regex for linguist grammar compiler

jesslin02 commented 1 year ago

when attempting to add support for lflang to linguist, the grammar compiler detects these issues with the lflang.tmLanguage.json file:

- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`;|(?<=}.*)$|(?<=target[^\{]*)$`": lookbehind assertion is not fixed length (at offset 9))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=(^|,|\{)(\s)*)\b([\w\-]+)\b(`...": lookbehind assertion is not fixed length (at offset 17))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=\s*=\s*new\s*(\[[^\]]*\])?\s`...": lookbehind assertion is not fixed length (at offset 33))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=\w+((\.)|(::)))(\b\w+\b)`": lookbehind assertion is not fixed length (at offset 18))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=\breactor\b\s*)(?!at)(?!exte`...": lookbehind assertion is not fixed length (at offset 18))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=\bmethod\b\s+)\w+\b`": lookbehind assertion is not fixed length (at offset 17))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=(input|output|action|state).`...": lookbehind assertion is not fixed length (at offset 34))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=(input|output|action|state).`...": lookbehind assertion is not fixed length (at offset 39))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=(\\.|[^"\\])*)("|$)`": lookbehind assertion is not fixed length (at offset 17))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=(\\.|[^'\\])*)('|$)`": lookbehind assertion is not fixed length (at offset 17))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=(input|output|action|state).`...": lookbehind assertion is not fixed length (at offset 39))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=(input|output|action|state).`...": lookbehind assertion is not fixed length (at offset 39))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=(input|output|action|state).`...": lookbehind assertion is not fixed length (at offset 34))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`;|(?<=}.*)$|(?<=import[^\{]*)$`": lookbehind assertion is not fixed length (at offset 9))

lhstrh commented 1 year ago

The problems that Linguist reports on have to do with the fact that the grammar is trying to do a bit more than just tokenization, which it probably shouldn't do. Essentially, all that this grammar is supposed to do is recognize, based on Oniguruma regular expressions, tokens in a stream of text. In our context-free grammar (https://grammar.lf-lang.org/) you can find these regular expressions also. While TextMate files are often are referred to as a "grammar", they really are not grammars, because most programming languages are context-free, meaning they can't be parsed merely using regular expressions.

In other words, a simplification of the lflang.tmLanguage.json is in order, meaning that some of the more general things that happen in it might have to move into the TypeScript code. Would you be interested in taking a stab at this, @jesslin02?

petervdonovan commented 1 year ago

I started writing at the same time as Marten. I'll still just post this anyway to try to fill in some additional details:

Certain regex features are not universally supported, probably because they are hard to implement efficiently. Apparently non-fixed-width lookbehind is one such feature.

Code highlighting in VS Code is divided between syntactic highlighting and semantic highlighting. Both syntactic and semantic highlighting are performed in a way that is a little bit wrong using code that is completely separate from the parser that is used in the compiler -- this is acceptable because it almost always works properly, and because highlighting is not a very hard requirement. The syntactic highlighting is handled by TextMate; the semantic highlighting is handled by highlight.ts. There isn't a very sharp distinction between syntactic and semantic highlighting, but Marten says that the job of the TextMate grammar should be limited to tokenization.

We have a few options then:

Remove the offending rules (which use non-fixed-width lookbehind) from the TextMate grammar and put them in highlight.ts instead. Note that this will make the highlighting on our website and by linguist somewhat less rich because neither use highlight.ts.
Patch up the offending rules, but keep them in the TextMate grammar. This will likely make the TextMate grammar a little bit more wrong than it already is; we get to decide what level of wrongness is acceptable to us.

In any case, changes that we make to work around this issue must be tested by visually assessing the highlighting of example LF programs (we don't have regression tests for this). It is wise to use the scope inspector to make sure things work as you think.. Tokens that are highlighted by the TextMate grammar will have scope names that end in ".lflang". We need to avoid significant regressions not only in overall highlighting, but also in highlighting that is just performed by the TextMate grammar. You can also disable semantic highlighting by commenting out this, just to make sure that things still look OK without it.

lhstrh commented 1 year ago

These are great suggestions. Thanks, @petervdonovan!

jesslin02 commented 1 year ago

thanks @petervdonovan and @lhstrh! do you have any recommendations for example LF programs for visual tests? i've been finding some specific example lines that match the TextMate scope that i would like to modify by trial and error, but i'm not sure if there's a more comprehensive way to go about it.

lhstrh commented 1 year ago

I think that if you focus on unit tests you can get away with using very small code fragments that hit those lines. You don't need complete programs that are able to execute or do anything meaningful.

jesslin02 commented 1 year ago

resolved through changes in #104

lf-lang / vscode-lingua-franca

regex for linguist grammar compiler #92