Generate syntax map externally, feed to plugin

alerque commented 4 years ago

New idea. While mucking around with LPEG and various CommonMark implementations for #327, I ran into the pulldown-cmark project. Given I've been learning Rust lately, the speed at which this thing works got my attention.

I know there are plugins out there that use external scripts or binaries to inform their work. For example vim-clap uses a Rust backend as a data provider to its fuzzy search operations. Many other plugins use Python or Lua data providers.

We know we can use a Lua based PEG grammar and feed the syntax highlighter data. What's to stop us from using a much faster backend (and an implementation type the author of CommonMark doesn't think is impossible) that is already CommonMark compliant to generate a source map and use that to inform our plugin?

Obviously asking Pandoc would have been preferable (see #300), but as of yet Pandoc doesn't keep a source map, and pulldown-cmark does. If Pandoc is moving towards CommanMark and this already is 100% CommonMark compliant, is there a reason not to go down this road?

Asking for a friend.

alerque commented 4 years ago

Just as a benchmark for the kind of speed we're talking about, I took the source code for a 100k word book that has about 2k cross-reference links and fed it though pandoc and pulldown-cmark. I'm in quarantine away from my work computer(s) on a CPU that came out in 2008 — so not a fast machine.

Pandoc: 12.1 seconds Pulldown-cmark: 0.038 seconds

I had to run it on loops of 100 to even get a reasonable estimate of the time elapsed, and even then it took me a while to convince myself it wasn't just dumping some cached result and exiting.

fmoralesc commented 4 years ago

Really interesting find! How much do you estimate it would take to extend the parser so it can handle some of the pandoc extensions to commonmark? I was looking at the commonmark spec the other day and it really is very lean. Is the output of pulldown-cmark comparable with pandoc's?

alerque commented 4 years ago

The HTML they generated is not 100% compatible. The pulldown-cmark parser didn't know what do to with my inline SILE code and just output it wrapped in <code> tags. It also didn't do the +smart conversions that Pandoc did by default. Their HTML encoding for for footnotes was a bit different, and there are some other differences, but on a basic level they clearly understood the Markdown in roughly the same way.

Right now extending the parser it uses might be a little beyond me. It is pretty low level byte by byte parsing. However the high level idiomatic Rust interface it provides when it is done is a dream to work with.

It does optionally cover more than the CommonMark spec (footnates, Github flavored tables, task lists, stikethrough, etc.), so in theory it should be possible to add more optional extensions. I'm not sure bringing it around to legacy Pandoc flavored markup is in the cards though. Adding extensions for extra features maybe, but changing the existing ones to reflect the flavor variations, probably not.

The main thing that caught my attention is not it's rendered output but the combination of the speed plus the easy access to the source map data from a library that would be relatively simple to build our own backed on.

fmoralesc commented 4 years ago

Really, as long as we can get a good representation of the structure of documents, we can supplement anything it's missing on the client side.

fmoralesc commented 4 years ago

We don't even need to call the library, the binary has an option to emit the events and ranges it detects:

24839..24876: Start(Heading(2))
24842..24875: Text(Borrowed("The source/justification question"))
24839..24876: End(Heading(2))
24877..25162: Start(Paragraph)
24877..24953: Text(Borrowed("Having a pre-theoretical sense of the epistemic status of modal opinions, we"))
24953..24954: SoftBreak
24954..25017: Text(Borrowed("are faced with a justificatory task. If our modal opinions are "))
25017..25030: Start(Emphasis)
25018..25029: Text(Borrowed("prima facie"))
25017..25030: End(Emphasis)
25030..25031: SoftBreak
25031..25110: Text(Borrowed("justified, what does or could justify them? And how? Further, if we can come to"))
25110..25111: SoftBreak
25111..25117: Start(Emphasis)
25112..25116: Text(Borrowed("know"))
25111..25117: End(Emphasis)
25117..25161: Text(Borrowed(" modal matters, how do we come to know them?"))
24877..25162: End(Paragraph)

It is possible to process this output and feed vim syntax highlighting with matchaddpos() (we would need to compute the lines though, because the offsets are bytes, but vim provides byte2line()).

alerque commented 4 years ago

I realize we could parse that text dump –and maybe as a proof of concept it's worth tinkering with– but I think it will be an order of magnitude faster to write ourselves a small library that parses exactly what we need in Rust and spoon feeds it to Lua inside vim in away that needs as little post-processing as possible to refresh the highlights.

fmoralesc commented 4 years ago

I haven't worked with rust at all, but if you are willing to go in that direction it's worth exploring I think. I can help out with some glue on the vim side. From what I gather from the event dump, it should be enough to get all the Start() event ranges.

alerque commented 4 years ago

It looks like there are a couple different architectures we could go with. I'm not sure I can rightly describe them all, much less pick what's best.

An interesting note for now is that there seems to be a way to use RPC calls directly from rust to add/remove highlights in a Neovim instance.

If using the msgpack-rpc interface isn't what we want to do (or we don't want to limit this to Neovim) the other main alternative I see is using mlua to create a module syntax.so which can be directly loaded into Lua syntax = require("syntax") that provides whatever functions we want to expose. I assume we'd be passing in the buffer contents (or a slice of it?) and get back a list of line/column indexes with what syntax stops or starts there.

fmoralesc commented 4 years ago

An interesting note for now is that there seems to be a way to use RPC calls directly from rust to add/remove highlights in a Neovim instance.

This is just a version of nvim_buf_add_highlight. A limitation this has is that it doesn't support ranges (so it's a bit more cumbersome to deal with blocks). I wouldn't mind using just the neovim interfaces, but it would limit the potential use of the plugin (I also wouldn't mind, to be honest). On the other hand, it seems like it would be a great way to highlight inline elements.

The mlua solution might be worth checking out as well.

I assume we'd be passing in the buffer contents (or a slice of it?) and get back a list of line/column indexes with what syntax stops or starts there.

In neovim we could listen to document changes events, and keep something like a shadow version of the documents that could be parsed asynchronously. Slices are difficult because partial markdown might not parse correctly. For example, in this case:

1. item
    2. item
        3. item

vs.

    2. item
        3. item

we might get a list in the first case and a codeblock in the second.

alerque commented 4 years ago

I got a basic module working with mlua such that it exports a function that can receive data from Lua, do something, and send data back. Making Neovim talk to this library has proved to be harder. The module will build for Lua 5.3, 5.2, 5.1, or LuaJIT 2.1 (beta). It will not build for LuaJIT 2.0.5 (https://github.com/khvzak/mlua/issues/3). Of course the Neovim that comes on Arch Linux is compiled against LuaJIT 2.0.5, the last stable release.

Buggers.

What this suggests is that maybe this road leads to insanity. If the plugin has to be compiled against the version of Lua that people's editors was compiled against (i.e. they need matching header files) we're going down a road to purgatory. We'll never come out alive. We might be able to limit ourselves to Neovim, but we can't force people to recompile their Neovim to use a version of Lua we support.

Unless it turns out I'm missing something there, it looks like the RPC route is the way to go. There is some effort to make RPC work with VIM8 too.

We don't have to use the highlight command there is an RPC binding for to nvim_buf_add_highlight, we should be able to do anything we want on the vimscript / internal Lua side. We just pass data back and force over an RPC message channel instead of loading a binary library.

bpj commented 4 years ago

I just want to say ~~two~~/three things:

I really hope you won't drop support for regular Vim. Neovim has dropped support for perl which I depend on in some cases, the clunky vim/perl interface notwithstanding. TBH mainly my plugin for doing search/replace with Perl (Unicode properties!) regexes, but I depend on that rather heavily. It could probably be rewritten to use Python's regex module and may even benefit from it, but anyway, the regex engine doesn't have all the features of the current perl engine (and vice versa in some cases, to be sure.)
Writing a PEG parser (I don't know if you discussed that here or in another issue but anyway) for a syntax as free-form and as whitespace/relative-indentation-dependent as Markdown isn't a mean task. I have tried. It is possible as jgm/lunamark shows, but no mean task.
I have been trying to tweak tpope/vim-markdown for use with termux on my phone where this project is not an option, so I really appreciate what you are trying to do, although there is of course always a risk of creating as many problems as you solve!

(Sorry for the non-links! I haven't been able to figure out how to reference a repo/org rather than an issue!)

alerque commented 4 years ago

@bpj Thanks for the feedback.

I hear you. Personally I have no interest in regular VIM, the number of things Neovim improves over that experience for me is just a landslide win. On the other hand I don't plan to pitch it to the wind unless we really have no other choice. We're exploring lots of options here and have not settled on an architecture. But we do have 100+ issues piling up with a glitchy set of syntax rules that nobody feels like fixing because it's a Pandora's box of problems. That can't go on forever. If we can find something we can work with that gets significantly better results (more language coverage, fewer false positives, etc.) then we'll probably go with that even if it means loosing VIM8. We won't be disabling the legacy system any time soon so don't panic, but if we can't finde a general purpose approach tha we want to sink out time into developing we may end up making godo things only happen in Neovim.
JGM himself, with two attempts at PEG grammars under his belt, is telling me he doesn't think it's possible — he's open to being proven wrong but that won't be an easy nut to crack!
Have you tried the plasticboy/vim-markdown plugin? Also, why do you say this plugin isn't an option in Termux? It should be usable for syntax independent of the main Pandoc plugin (which I understand won't be usable there since nobody has gotten Haskell compilation going in Termux yet).

fmoralesc commented 4 years ago

I got a basic module working with mlua such that it exports a function that can receive data from Lua, do something, and send data back.

@alerque Did you make more progress? This morning I made a little python module using pyo3, but I didn't get very far because I have no rust experience (I did manage to make the module consume commonmark and spit out html, so it wasn't all loss :p)

alerque commented 4 years ago

I did get somewhere actually. I got the mlua option to build a module against LuaJIT 2.0.5 so I can load it from Neovim's embedded Lua (at least on Arch Linux) and I can pass data back and forth from Neovim -> Rust -> Neovim.

What version of Lua is built into your nvim?

I still suspect the RPC method is the right way to do this, but this is an interesting Learning experience for sure.

fmoralesc commented 4 years ago

I'm running Arch too (using the neovim-git package), so 2.0.5.

alerque commented 4 years ago

I'm hacking in the mlua branch for now you can take a look if you want. Until an upstream issue gets worked out you have to build it on Arch with:

LUA_INC=/usr/include/luajit-2.0/ LUA_LIB=/usr/lib LUA_LIB_NAME=luajit-5.1 LUA_LINK=dynamic cargo build

After building (once) I've just been symlinking the the module into the project root directory to load in Neovim:

ln -s target/debug/libvim_pandoc_syntax.so

Then from Neovim:

:lua s = require("libvim_pandoc_syntax")
:lua print(s.render_html("some *markdown* string"))

Messing with passing other data back now. I'm in Gitter if you're around.

bpj commented 4 years ago

@alerque Vim-pandoc works in Termux but it uses a lot more storage than I feel I can spare just to get some more bells and whistles. I use it on my tablet where I've got more storage.

So JGM deems his PEG markdown parsers to be failures? Doesn't really surprise me!

I've been thinking all day about reimplementing my Perl search/replace plugin in Python. If you drop regular Vim I'll probably do it! However I also found out that the Termux ~~vim-python~~ nvim build barfs on plugins using python3. That will probably change soonish though since py2 has been officially declared dead.

bpj commented 4 years ago

Sorry I meant to say that the termux nvim build barfs on plugins requiring python 3 while the vim-python build doesn't. I haven't checked on my laptop yet but I imagine nvim is/can be built with py 3 support there.

As for running Pandoc in Termux last I heard Pandoc doesn't build on ARM.

alerque commented 4 years ago

Neovim has dropped support for perl

@bpj This didn't sound right to me. I looked into it and the Perl snafu seems to have just been a case of temporarily breakage during refactoring. Lots of things didn't work at first when they were first bringing up the basics after the great dead code purge, and getting Perl back online seems to have taken longer than most features. The good news is Perl is back on the menu (or will be in the next release if you're on stable release packages, it works for me now in the current Git master).

bpj commented 4 years ago

@alerque a small update on this:

I can now use plugins which use python3 (but not python2) in nvim on termux (pip3 install neovim :-) so no worries on that front.

I still can't use Perl in neither vim nor nvim on termux simply because their builds of both are done without Perl, and it doesn't seem like they are going to change their mind! :)

So it seems I'll have to do that rewrite of my regex substitution plugin using the python regex module. Makes an old Perl hacker's heart bleed but seems like a good way of getting my feet wet in writing python plugins.

vim-pandoc / vim-pandoc-syntax

Generate syntax map externally, feed to plugin #332