tree-sitter-grammars / tree-sitter-markdown

Markdown grammar for tree-sitter
MIT License
375 stars 45 forks source link

MDX Support #81

Open panzacoder opened 1 year ago

panzacoder commented 1 year ago

Surprised this hasn't been mentioned here before, but has anyone considered adding MDX support for this parser? I understand it may take some consideration to support MDX fully with JSX syntax recognition, but it would be really nice if this just also supported files with the .mdx extension.

Thoughts?

pappasam commented 1 year ago

This would be amazing

MDeiml commented 1 year ago

I guess half of this is already kind of implemented (highlighting-wise) since HTML tags should be recognized as such in most situations. The rest (import/export/statements in curly braces) also seem to be quite well defined and easy to implement. Also I already made this parser configurable using compile time environment variables, as such this shouldn't require forking the parser.

All that saying that everything is in place to implement this without too much effort. I don't think this will be a priority for me anytime soon as I don't have too much time to spend on this parser, but I will gladly help anybody trying to add this functionality.

olrtg commented 1 year ago

Hey! I was looking into how to extend this parser in a new project but it seems we can do this here.. I'm new to this whole tree-sitter parser authoring.. but I'm willing to learn and work on this if @MDeiml help me with this (mostly answering questions and pointing to the right resources). How can we move forward with this? I'm honestly surprised that MDX is not supported outside vscode since is quite popular.

MDeiml commented 1 year ago

First of all I suggest we settle on a spec for .mdx file which seems to be this. For a coarse overview on how this parser is structured see the Code Overview section in CONTRIBUTING.md.

There are two syntactical elements, that need to be implemented on top of this parser:

  1. import / export statements (these are part of the "block" parser)
  2. expressions in curly braces (these are part of the "inline" parser)

For 1. most of the code needs to go in tree-sitter-markdown/src/scanner.cc. import / export statements should work similar as "atx headings" in the sense that they are leaf blocks (they can't contain any other blocks) and they are recognized by a keyword at the beginning of the line. I recommend that you mainly copy the code for atx headings in tree-sitter-markdown/src/scanner.cc and tree-sitter-markdown/grammar.js and take it from there.

For 2. most of the code needs to go in tree-sitter-markdown-inline/src/scanner.cc. The curly braces work similar to "latex blocks" because they are also leaf elements (can't contain other inline elements) and are recognized by their opening and closing delimiters. Again I would recommend copying the code for latex blocks in tree-sitter-markdown-inline/src/scanner.cc and tree-sitter-markdown-inline/grammar.js and taking it from there.

You can test your code by adding files to tree-sitter-markdown/corpus and tree-sitter-markdown-inline/corpus respectively and running npm test.

Finally, mdx should probably be treated as an optional extension, so you can have a look at the beginning of common/grammar.js on how extensions are treated, but you can leave that until later.

(I'd like to emphasize again that you should mainly copy code and modify it to your needs. There are some intricacies to this parser that stem from the restrictions tree-sitters parsing model and that lead to some pretty weird and ugly code, but I'd like to avoid getting into that.)

Sry for the wall of text. If you have any more questions feel free to ask :)

olrtg commented 1 year ago

Thanks for the quick reply! Yeah they seem to have a specification in a github repo.. I think we can use the link you've provided plus the mentioned link to work on this.

I'll checkout the structure of the project in detail after work. Will reach again soon, probably with some questions.

olrtg commented 1 year ago

Hello! I think I'm starting to understand the workflow with this project in particular (and treesitter in general).. so I have a couple o questions just to confirm:

  1. You want me to follow the same code pattern that you have been using in this project which it seems to be (at least for the majority of rules) to handle the tokens with a external scanner.. so i just need to clone a leaf block in the grammar.js file for creating these new tokens (for example import/export) but do the actual parsing in the scanner.cc file without using the treesitter grammar rules..is that correct?
  2. The import block AST has for example a import_statement which contains a list of named_imports and then the source (will leave a screenshot).. meaning it seems complex (to me at least).. you want me to copy/paste the code for this tokens from the jsx parser or should I just import the parser and reuse those tokens?

image

Thanks for your time again!

MDeiml commented 1 year ago
  1. Yes, sorry I should have specified that. You are correct, most of the parsing sadly has to be done in the external scanners, so what you are describing is what I meant. (Be careful that the external tokens in grammar.js and scanner.cc have the same order.)
  2. Actually you don't have to implement the "finer" structure of import / export statements in this parser. Instead you can capture the whole statement and then leave it to the js grammar to figure out the structure through "language injections" (the import statement as a whole is valid js syntax). For this you will need to edit tree-sitter-markdown/queries/injection.scm, in there you can see for example how this parser similarly does not actually care about the finer structure of html and leaves it to the html grammar to figure it out.

    Language injections are not really part of tree-sitter, but rather a best practice that's implemented in probably all editors that use tree-sitter, see also this section in the docs.

olrtg commented 1 year ago

Hey! Sorry I took some time off from personal projects. Yesterday I came back to it but I think is just hard for me to learn all at once (treesitter and specially c) from an existing project. I'm going to create a new project and try to extend this parser with the grammar.js file. And if I'm successful I will try to add the mdx support in this repo directly in c. Thanks again for your time!

olrtg commented 1 year ago

Did you consider publishing this as an npm package? I see that some parsers do this and then other parsers use them (with npm install) to extend the parser. It'll be great for my use case 😄

MDeiml commented 1 year ago

Sure, I could try, though I'm not familiar. Could you give me an example for a parser that does that?

bennypowers commented 1 year ago

Please consider doing this as a separate parser. MDX is a separate language with different syntax and semantics. We do a lot of html-in-markdown at work, and I'm concerned that layering mdx on top of the existing semantics will break things in subtle ways.

paulkre commented 8 months ago

Neovim's default Markdown syntax highlighting is currently better at handling MDX content than the Treesitter version