tree-sitter-grammars / tree-sitter-markdown

Markdown grammar for tree-sitter
MIT License
411 stars 52 forks source link

feature: pandoc raw_tex and raw_attribute support #145

Closed anuramat closed 5 months ago

anuramat commented 5 months ago

Did you check the tree-sitter docs?

Is your feature request related to a problem? Please describe.

Pandoc has a few handy extensions, that allow embedding raw content in the output document. It would be nice to support these for e.g. highlighting.

  1. Raw attributes

    Inline spans and fenced code blocks with a special kind of attribute will be parsed as raw content with the designated format. ...

    ```{=latex}
    \begin{tabular}{|l|l|}\hline
    Age & Frequency \\ \hline
    18--25  & 15 \\
    26--35  & 33 \\
    36--45  & 22 \\ \hline
    \end{tabular}
  2. Raw TeX

    ... pandoc allows raw LaTeX, TeX, and ConTeXt to be included in a document. Inline TeX commands will be preserved and passed unchanged to the LaTeX and ConTeXt writers. ...

    \cite{jones.1967}
    \begin{tabular}{|l|l|}\hline
    Age & Frequency \\ \hline
    18--25  & 15 \\
    26--35  & 33 \\
    36--45  & 22 \\ \hline
    \end{tabular}

Describe the solution you'd like

  1. Language injection for fenced blocks with raw attributes, just like with regular fenced code blocks
  2. TeX injection for LaTeX environments (\begin{}-\end{} blocks); I'm not sure if supporting other *TeX commands would be feasible.

Describe alternatives you've considered

No response

Additional context

https://pandoc.org/MANUAL.html#extension-raw_attribute https://pandoc.org/MANUAL.html#extension-raw_tex

MDeiml commented 5 months ago

Feature 1 should be quite doable, it would just imply improving the rules for parsing the language here:

https://github.com/tree-sitter-grammars/tree-sitter-markdown/blob/7fe453beacecf02c86f7736439f238f5bb8b5c9b/tree-sitter-markdown/grammar.js#L186-L196

I personally will probably have no capacity to work on this though.

Feature 2 would be quite hard, since it both is hard to detect (one would need to know if there is an end block before it is clear whether something is latex) and collides with other grammar rules.

clason commented 5 months ago

Feature 1 already works, though, by virtue of injections? You just have to use proper language annotations:

```latex
\begin{tabular}{|l|l|}\hline
Age & Frequency \\ \hline
18--25  & 15 \\
26--35  & 33 \\
36--45  & 22 \\ \hline
\end{tabular}
anuramat commented 5 months ago

your example would be pasted as a (visible) code block, and the one with {=latex} would be inserted in the actual latex output as latex code, or rendered, in case the output format is e.g. pdf, in this case as a table

clason commented 5 months ago

That's completely out of scope for a Markdown tree-sitter parser, sorry. I would recommend creating your own pandoc parser that extends this one (grammars can inherit others).

(We've had other issues with forcing pandoc files as markdown, so I'd prefer to treat this as a separate filetype from here on out.)