microsoft / vscode-textmate

A library that helps tokenize text using Text Mate grammars.
MIT License
562 stars 108 forks source link

Nested `begin`/`while` rules do not continue past begin match #209

Closed dawsonc623 closed 1 year ago

dawsonc623 commented 1 year ago

Given the following grammar and input, I am finding an issue where the top-level begin/while rules apply as expected (the foo set, in the example), but the nested while ones do not seem to work. The second level's (bar in the example) begin works as confirmed by the token inspector, but its while matches do not seem to apply given the none of the applicable lines are given the provided scope (meta.bar in the example), nor is the next level (baz in the example) ever applied (not even the begin applies).

Grammar:

{
  "name": "Foo",
  "scopeName": "source.foo",
  "patterns": [
    { "include": "#foo" }
  ],
  "repository": {
    "foo": {
      "name": "meta.foo",
      "begin": "^(\\s*)foo.*",
      "while": "^(\\1\\s+.*)",
      "whileCaptures": {
        "1": {
          "patterns": [{ "include": "#bar" }]
        }
      }
    },
    "bar": {
      "name": "meta.bar",
      "begin": "^(\\s+)bar.*",
      "while": "^(\\1\\s+.*)",
      "whileCaptures": {
        "1": {
          "patterns": [{ "include": "#baz" }]
        }
      }
    },
    "baz": {
      "name": "meta.baz",
      "begin": "^(\\s+)baz.*",
      "while": "^(\\1\\s+.*)"
    }
  }
}

Input:

  foo
    bar
      baz
      baz
      baz
    bar
      baz
      baz
  biz
  foo

Note, my end goal is to do processing on a whitespace important language where some rules only apply to lines nested "within" certain sections. In the example, bar's rules are only applicable within foo sections, which is triggered by a line whose first non-whitespace characters amount to foo, and only the following lines that contain the amount of whitespace proceeding foo on the begin line plus at least one more are considered part of that foo section (ergo, the biz line is "nothing" in the given input and grammar). The same nesting relationship exists between bar and baz, and theoretically more nesting relationships (including cyclical; foo inside of baz, for example) could exist ad infinitum.

Intuitively, I would expect the while rules in bar to apply, but based on my testing and what I think I found in the source code (admittedly I only spent about an hour actually producing this example and looking at the source code) it seems foo's apply first and seem to "eat" the line without giving it over to bar for its own continuation. Indeed, the way the rules stack is built and applied in src/grammar/tokenizeString.ts starting on line 343 looks to be designed to "reverse" the stack and apply each level fully before moving on (based both on the code and comment above the function at line 331).

If supporting nested begin/while rules is desired, depending on Microsoft's prioritization of it and however external open source contributions are handled I am willing to take a look at the issue myself as I am blocked by it currently.

jeff-hykin commented 1 year ago

I agree, I believe what's happening here is the while is eating all the characters, e.g. #3 in the stuff below. (I'm just a random grammar/syntax maker BTW) That said I think it is still possible to nest them.

Here's a copy-paste of my personal documentation on while, which I think partly confirms what you saw in the source code.

The textmate "while" key has almost no documentation. I'm writing this to explain what little I know about it.

The good part

The "while" key is stronger than the "end" pattern, as soon as the while is over, it stops and most importantly, it cuts off any ranges that are still open. This is incredibly important because almost nothing else in textmate does this, and it is useful for stopping broken syntax.

I believe it was designed to match things like the python intentation-based block.

The bad part(s)

However, there are some caveats.

  1. The "while" pattern is line-based, not character-based. If you match a single character on a line, then the whole line is considered to be inside the pattern-range
  2. On each line, nothing will start being matched until the while pattern has been fully matched
  3. Once the while pattern matches, everything after the while pattern will be tagged using the patterns inside of the pattern-range (the included patterns)

(end)

I have no idea if any of that is intended behavior. That said, I think your issue is the first while consumes all the text on the line, so there's nothing left for the included patterns to even match against.

I would guess that changing it to match just the white space of one indent level, and then including bar as a regular include (rather than a whileCapture) would fix it. I don't think I've ever gotten the while captures to work, I usually just use a lookahead in the while, and then include a pattern that does the actual matching

dawsonc623 commented 1 year ago

Thanks for the response. Interestingly, Python does not use while for this case (at least not the built-in support). Only a handful of built-ins use while at all, and Markdown is the only one that does extensively (apart from searchResult, which I assume is, well, related to the search feature).

Anyway, I went with capturing the full line and passing it into the whileCaptures because generally how I have seen that work in other constructs (say, match/captures) is that the includes can process the incoming as they need. That said, some of your thoughts triggered a different line of thinking, and I was able to do some adjustments to the test grammar to match the test input as I would expect. I will attempt to port that over to my real grammar and report back if it works.

dawsonc623 commented 1 year ago

So, it worked on my full grammar (well, mostly, but I think where it is acting up is unrelated to this). I extended the test grammar and input to cover more nesting and cyclical cases, too, which worked.

The fix was indeed to step back from the whileCaptures approach and instead have while match just the captured level of indentation from the begin and a look-ahead to see if the next character was another whitespace. Then, putting patterns at the top level worked. The extended example:

Grammar:

{
  "name": "Foo",
  "scopeName": "source.foo",
  "patterns": [
    { "include": "#foo" }
  ],
  "repository": {
    "foo": {
      "name": "meta.foo",
      "begin": "(\\s*)foo.*",
      "while": "\\1(?=\\s)",
      "patterns": [{ "include": "#bar" }, { "include": "#hmm" }]
    },
    "bar": {
      "name": "meta.bar",
      "begin": "(\\s*)bar.*",
      "while": "\\1(?=\\s)",
      "patterns": [{ "include": "#baz" }]
    },
    "baz": {
      "name": "meta.baz",
      "begin": "(\\s*)baz.*",
      "while": "\\1(?=\\s)",
      "patterns": [{ "include": "#foo" }, { "include": "#hmm" }]
    },
    "hmm": {
      "name": "meta.hmm",
      "begin": "(\\s*)hmm.*",
      "while": "\\1(?=\\s)"
    }
  }
}

Input:

  foo
    bar
      baz
        foo
          bar
            baz
              hmm
      baz
      baz
        hmm
        hmm
        hmm
    bar
      baz
      baz
  biz
  foo
    hmm

The token inspector confirmed this works as I would expect, so regardless of what is going on with my full grammar I think the original issue was more in terms of my understanding of whileCaptures (or apparently lack thereof) than the overall nested concept. Because of that, I am closing this issue under the assumption the wonkiness in my full grammar is not quite this issue either.