microsoft / vscode-textmate

A library that helps tokenize text using Text Mate grammars.
MIT License
562 stars 108 forks source link

Textmate engine bug for `\k<>` backreferences #193

Open jeff-hykin opened 1 year ago

jeff-hykin commented 1 year ago

Example of working as expected

Input is on the left, output is on the right.
The "end" pattern is referencing the 2nd group created in "begin" (the EOF)

should_happen

What (intentional) failure looks like (non-issue)

A bad pattern causes this kind of behavior: (note: yellow is the theme's color for entity.shell, which is the included pattern)

bad_pattern

What is broken

\k<2> should be equivlent to \2 and in other places it does behave equivlently
However, instead of failing normally (e.g. all-yellow) it seems to trigger undefined behavior: (Note: \2 is not a viable workaround when group numbers are ≥10)

Screen Shot 2022-12-12 at 3 05 46 PM

Here's the code for the problematic pattern. This is for VS Code 1.72.2, on Mac M1

{
    "begin": "(<<)\\s*+\\s*+((?<!\\w)[a-zA-Z_][a-zA-Z_0-9]*(?!\\w))(?=\\s|;|&|<|\"|')",
    "end": "\\2",
    "beginCaptures": {
        "1": {
            "name": "keyword.operator.heredoc.shell"
        },
        "2": {
            "name": "string.delimiter.shell"
        }
    },
    "endCaptures": {},
    "name": "string.unquoted.heredoc.no-indent.shell",
    "patterns": [
        {
            "match": ".+",
            "name": "entity.shell"
        }
    ]
}
RedCMD commented 1 year ago

Can confirm seems like there are two different points being made

  1. \\k<2> does not behave the same as \\2 when backreferencing capture groups between begin/end rules. I would think is a non-issue as it would be annoying to have to count all the capture groups in begin when trying to reference one in end (through the usage of \\k<2>)
  2. an invalid group number in \\k<2> causes the textmate engine to crash. this is more or less consistent with all other textmate errors. eg. invalid \\g<4> groups seems like \\2 inside end has a special property, to not crash the engine when capture group 2 does not exist, but instead match against nothing .

\\h<2> matches against a hexadecimal number and the literal chars <2> image

(Note: \2 is not a viable workaround when group numbers are ≥10)

\\14 works fine for me? image

jeff-hykin commented 1 year ago

\14 works fine for me?

Oh interesting, I suppose (?:\14)4 would be equivlent to \k<14>4 in that case. So there's still a bug, but there's a reliable workaround (which is great news for me)

causes the textmate engine to crash. this is more or less consistent with all other textmate errors

I'd argue that for both \k and \g, either the crash should show up in the debug console, or (if crashing is not an option) then the engine should fallback on matching as an empty string. Having it partially highlight document, while sliently crashing is what I would consider an issue.

I would think is a non-issue as it would be annoying to have to count all the capture groups in begin when trying to reference one in end (through the usage of \k<2>)

Many existing syntaxes, like Ruby and Shell, would break if that reference-groups-from-the-start feature never worked. Just cause a feature is hard to manually use doesn't make it being broken a non-issue.

it would be annoying to have to count all the capture groups

I agree which is why I never count capture groups, I made the ruby grammer builder do the heavy lifting. Some C++ patterns have over 100 capture groups so it would've been unrealistic for me to maintain any other way.

Screen Shot 2022-12-13 at 10 23 23 AM