Tree-sitter grammar for syntax highlighting

TristanCacqueray commented 2 years ago

Is your feature request related to a problem? Please describe. The syntax highlighting support seems a bit fragile. While it seems to work, I wonder if it can be improved by using a tree-sitter grammar.

Describe the solution you'd like A tree-sitter grammar to be added to the swarm project by following https://tree-sitter.github.io/tree-sitter/creating-parsers. Then it should be integrated in https://github.com/emacs-tree-sitter/tree-sitter-langs/tree/599570cd2a6d1b43a109634896b5c52121e155e3/repos. For vim, swarm can provide a sample configuration to load the grammar: https://github.com/nvim-treesitter/nvim-treesitter#adding-parsers.

Describe alternatives you've considered https://www.masteringemacs.org/article/tree-sitter-complications-of-parsing-languages mentions CEDET, but that seems to be superseded by tree-sitter.

byorgey commented 2 years ago

I hadn't heard of tree-sitter before. Looking into it a bit, it looks cool, but I don't understand what the benefit would be as compared to just implementing more of the LSP protocol, to give editors semantic information about tokens which they can use for syntax highlighting. Indeed, the tree-sitter plugin for VSCode appears to be deprecated for exactly this reason.

To be more blunt, maintaining a separate parser in whatever custom grammar description language tree-sitter uses, and having to keep it up-to-date every time we change the swarm-lang syntax, sounds absolutely awful. I would much, much, much rather get nice syntax highlighting via LSP, which means we can just piggyback on the existing Haskell parser for swarm-lang.

TristanCacqueray commented 2 years ago

Good points, though according to the masteringemacs article linked above, it seems like LSP is a poor fit for syntax highlighting. I guess it's worth a try.

I agree it's awful to duplicate the work, but we are kind of already doing this for emacs with regex, and vscode with textmate. Perhaps using tree-sitter as a drop-in replacement is not as bad as it sounds, and we could get vim support for free.

For vscode, it seems like https://github.com/microsoft/vscode-anycode is the extension that leverages tree-sitter.

byorgey commented 2 years ago

Ah, OK, that makes sense. To summarize, some of the main reasons that article claims tree-sitter gives much better performance than LSP:

tree-sitter is built to be incremental and to work well even when the file has a syntax error. Our Haskell parser for swarm-lang does not do this at all (and making it so would be a lot of work).
Sending every keystroke to LSP and getting back syntax highlighting info is a lot of interprocess chatter and slows things down considerably.

I still don't really like the idea of having to maintain two separate parsers in parallel, but I'm open to the possibility that it might be worth it.

Edited to add: Though I note that even https://github.com/microsoft/vscode-anycode says "This extension should be used when running in enviroments that don't allow for running actual language services."

byorgey commented 2 years ago

https://www.masteringemacs.org/article/tree-sitter-complications-of-parsing-languages seems to have been written 3 or 4 years ago; I would be curious to learn what (if anything) has changed since then.

Some more comparisons between tree-sitter and LSP, from 2018: https://news.ycombinator.com/item?id=18349488
But LSP has a new semantic highlighting protocol since late 2020 or so: https://github.com/microsoft/language-server-protocol/issues/18
Here is a more recent detailed discussion of tree-sitter vs LSP: https://github.com/nvim-treesitter/nvim-treesitter/issues/484

byorgey commented 2 years ago

What would writing a tree-sitter grammar for Swarm look like? A few thoughts/notes from https://tree-sitter.github.io/tree-sitter/creating-parsers#writing-the-grammar :

in order to produce an easy-to-analyze tree, there should be a direct correspondence between the symbols in your grammar and the recognizable constructs in the language -- this should be really easy, we just make the grammar correspond to the structure of the Term algebraic data type.
A close adherence to LR(1) -- this might be more difficult. There are a few places we use try which may cause issues, such as parsing a noop {} and in parseStmt.

xsebek commented 1 year ago

@byorgey actually is Swarm a LR(1) language? AFAIK the noop and operator (+/++) cases can be resolved with one character lookahead.

byorgey commented 1 year ago

I am not sure. I think so. But it's been a long time since I thought about various grammar classifications.

swarm-game / swarm

Tree-sitter grammar for syntax highlighting #323