[LaTeX] LaTeX Syntax Development

ngc92 commented 1 year ago

I'm planning to do some more work on the (La)TeX syntax. I've opened this issue to make sure that general discussion does not get lost after a particular PR is merged.

First, some general info (disclaimer: I'm no expert; this may not be entirely accurate): TeX is a macro programming language, and works by replacing tokens until only primitive things remain. This process is highly configurable from within the language itself, so that in principle the user can change the available syntax almost completely. Consequently, we cannot make a useful syntax highlighter for the actual language, but only for "idiomatic usage".

Based on this, people have made predefined sets of macros (formats) for end-user "convenience" (necessary unless you're an expert, I would say). Popular ones are

LaTeX document - Probably best considered an extension of (Plain)TeX.
BiBTeX - Looks quite different. Already supported by Sublime.
ConTeX - An alternative to LaTeX, also an extension of plain TeX. Haven't used it myself. Currently no Sublime support.
Expl3 - The new, internal programming layer for LaTeX. May be embedded in regular LaTeX, though typically for package implementers. Also an extension of TeX, I think.
LaTeX package - When implementing reusable packages, there are different structural commands, and it would probably make sense not to slow the syntax highlighting down with the huge amount of typesetting-related commands.
TikZ/PGF- Maybe best considered an embedded DSL for specifying graphics within a (La)TeX document.
dtx - "Documented (La)TeX" - a rather weird format, in which most of the programming is actually written inside TeX comments, which makes the current syntax highlighting pretty useless there. Many LaTeX packages are distributed in this way.
ins - Installer files for LaTeX packages. Also use TeX syntax, with a few extra commands available.

Then, of course, there are some highly popular packages, some of which are already supported by the current syntax.

I don't think I'm familiar enough with the details of sublime's syntaxes yet to start any big projects, so for the time being I'll try my hand at smaller improvements.

Things I'm planning to do for now:

add let, futurelet to TeX (https://en.wikibooks.org/wiki/TeX/let)
add \NewDocumentCommand and similar to LaTeX
Extract common (parts of) match patterns into variables
add more TeX primitives and LaTeX macros to be recognized. Without special syntax support, just mark them as "built-ins".

For the last point to make sense, we need to re-think the scoping of macros. Currently, all unknown macros are scoped as support.function. This is different from other languages, I think where anything unknown would be a variable.function, and only those actually recognized would get special treatment.

For TeX primitives, a list is given here: https://en.wikibooks.org/wiki/TeX#TeX_Primitives For LaTeX, this should serve as a decent command reference: https://tug.org/texinfohtml/latex2e.html

Here are two example files to test the TeX highlighting:

The macro set that Knuth used for typesetting the TeX manual. http://mirrors.ctan.org/macros/plain/base/manmac.tex
And the bootstrapping macro set for (plain)TeX itself. Since this is partly defining the syntax itself, we cannot really expect to get complete and useful scoping, but we should be able to get most parts. http://mirrors.ctan.org/macros/plain/base/plain.tex

deathaxe commented 1 year ago

Your efforts are wellcome and appriciated.

All of us started with small projects and learned our lessons on the way going. Your PRs look pretty well so far and I am sure we can guide you with the barly documented "standards" and best practices to help you learning - if you can set some pacience aside for our (my) maybe nagging questions :-)

Writing syntax definitions is not the easiest part as it requires both, a good understanding of how the syntax intself and the syntax engine work.

LaTeX syntax family hasn't been touched for a while and it may not be up-to-date with current best practices with regards to scoping. So there's probably a lot of room for improvements. Your suggestion with regards to variable.function makes absolutely sense.

If LaTeX is really an extension of TeX, we could/should probably implement all common parts in TeX and re-use them in LaTeX by extending it on TeX. This is what we did with HTML (Plain) and HTML for instance. While the former implements all general rules/patterns the latter one contains some more detailed and specific parts. I could imagine the same here to avoid duplicated implementations.

We probably can't make perfect syntax definitions which cover all details and special cases - even so some syntaxes are already pretty near to it. But as you already revealed - it depends on the syntax's nature. So it is very ok to support the basic parts only. Once you get used to sublime-syntax format, the more sophisticated ideas often come to mind on their own.

ngc92 commented 1 year ago

One more question is the scoping regard text or source scopes. Right now everything is specified as text.tex and text.latex, but that doesn't quite describe the reality. In fact, the TeX syntax specifies as file types .sty style and .cls class files, neither of which by themselves are designed to produce any text, but instead provide commands for another document to use for typesetting. And for LaTeX, the text part is also only between \begin{document} and \end{document}, though a .tex file might also be included into another one, so this is not always clear from the file alone.

deathaxe commented 1 year ago

I think main scopes have originally been defined by TextMate. I guess the idea behind may have been TeX/LaTeX to be some kind of markup language whose main goal is to produce printable documents. Hence text may have been choosen - even though a type setting language is much more like that.

Without an idea about all the macros and modules I would probably have choosen text as for Markdown or reStructuredText. But I've only studied some simple examples about how to create simple documents.

Technically ST treats source and text documents differently with regards to word wrapping auto-completions and maybe more aspects. Automatic word wrapping would be a technical argument I'd probably choose to argue for text as well. But nothing is black or white - as usual.

The one way or the other, I'd hesitate to change the main scope of a syntax at this point without compelling reasons.

ngc92 commented 1 year ago

Does it make sense to have both options? This somewhat corresponds to the "LaTeX document" and "LaTeX package" options I mentioned above. For example, for style and class files, we assume source.(la)tex and for .tex files we have tex. Because a style file really is very different from something like Markdown or reStructuredText, for example see here: https://github.com/ArmageddonKnight/NeurIPS/blob/master/neurips.sty

incidentally, that example also shows a case where both github's and sublime's current syntax highlighter go wrong. These constructs

\newif\if@natbib\@natbibtrue

define a new command of the name if@natbib, but the regex sees @ as the end of a word, whereas in (programming) tex it is often considered a letter.

deathaxe commented 1 year ago

We could think about a LaTeX (Package) syntax with a source main scope, which extends TeX or LaTeX and is bound to sty extensions if package syntax requires special treatment. I probably wouldn't choose that as a first step though, but create a stable base TeX syntax first, which could be extended to create other dialects. It increases chances to reuse things and reduce duplicated contexts.

ngc92 commented 1 year ago

We could think about a LaTeX (Package) syntax with a source main scope, which extends TeX or LaTeX and is bound to sty extensions if package syntax requires special treatment. I probably wouldn't choose that as a first step though, but create a stable base TeX syntax first, which could be extended to create other dialects. It increases chances to reuse things and reduce duplicated contexts.

This is more-or-less what I had in mind.

Regarding the general structure of the TeX-like syntaxes, would it make sense to condense the main context by adding something like the following:

command-sequence-escapes: 
  - match: `(?=\\)`:
    push: command-sequence

and then have command-sequence do the actual dispatch to the different commands we highlight? This should improve performance, right, as otherwise we'll re-check for the backslash in each context that can potentially appear.

Ideally, I'd like to also move the scoping of the escaping backslash into one place, but I'm not sure if that is possible with the current scope setup (i.e. the backslash being part of the specific scope we apply to the command).

deathaxe commented 1 year ago

I have mixed experiences with regards to such lookahead stuff. Sometimes performance is better, sometimes not. Probably depends on how likely such expressions are in real code. In general ST is quite good in compiling its optimized regexp pattern for each context.

Grouping it would probably be the last step if performance is an issue. Otherwise it may just add extra complexity, which is probably not worth it.

... I'd like to also move the scoping of the escaping backslash into one place...

Beyound your observations it would also cause meta scopes of pushed contexts not to cover the backslash, which is probably not ideal. The current way is therefore the ony way to go without breaking meta scope boundaries etc.

ngc92 commented 1 year ago

I've thought some more about the required context and scoping for TeX (mostly for #3587, but this applies more broadly), and I'm not sure what is the best way forward. The flexibility of TeXs syntax really makes it difficult to handle this reliably with simple contexts, I think. As newlines are generally treated as any other whitespace, we could have -\n1 and -\n1.5 as numbers. That means that, by the time we read and scope the minus sign, we don't know yet if the scope should be integer or float. I guess it is possible to handle this using branch contexts, but that makes things more complex.

I've taken a look how this is handled in C, but it seems that C does not actually follow the guidelines, the minus sign is not scoped as part of the meta.number. I've also managed to create this example where the C highlighter completely fails. I'll post this as a separate bug report.

deathaxe commented 1 year ago

We should probably not be too eager with such uncommon edge cases. Even if a compiler/interpreter can handle such things, it is not common practice and very unlikely someone implements it like that.

C family is a) very outdated and needs a rewrite and b) is a syntax which it is impossible for to reliably detect whether a - is a sign or an operator. This is exactly why keyword.operator was choosen rather constant.numeric.sign.

zepinglee commented 1 year ago

For TeX primitives, a list is given here: https://en.wikibooks.org/wiki/TeX#TeX_Primitives

Overleaf has a more complete list at https://www.overleaf.com/learn/latex/TeX_primitives_listed_by_TeX_engine. This list includes eTeX primitives like \protected and \detokenize.

sublimehq / Packages

[LaTeX] LaTeX Syntax Development #3531