microsoft / vscode-textmate

A library that helps tokenize text using Text Mate grammars.
MIT License
584 stars 116 forks source link

multiply applied capture groups seems to ignore some captures #127

Open asottile opened 4 years ago

asottile commented 4 years ago

a bit of an edge case, I'm not sure how this is supposed to be handled -- I don't have a concrete use case, just trying to implement my own parser in python using this as a reference

sample grammar

{
    "scopeName": "test",
    "patterns": [
        {
            "match": "((a)) ((b) c) (d (e)) ((f) )",
            "name": "matched",
            "captures": {
                "1": {"name": "g1"},
                "2": {"name": "g2"},
                "3": {"name": "g3"},
                "4": {"name": "g4"},
                "5": {"name": "g5"},
                "6": {"name": "g6"},
                "7": {
                    "patterns": [
                        {"match": "f", "name": "g7f"},
                        {"match": " ", "name": "g7space"}
                    ]
                },
                "8": {"name": "g8"}
            }
        }
    ]
}

sample file

a b c d e f z

tokenization using vs code

$ node vsc.js cap.json f

Tokenizing line: a b c d e f z
 - token from 0 to 1 (a) with scopes test, matched, g1, g2
 - token from 1 to 2 ( ) with scopes test, matched
 - token from 2 to 3 (b) with scopes test, matched, g3, g4
 - token from 3 to 5 ( c) with scopes test, matched, g3
 - token from 5 to 6 ( ) with scopes test, matched
 - token from 6 to 8 (d ) with scopes test, matched, g5
 - token from 8 to 9 (e) with scopes test, matched, g5, g6
 - token from 9 to 10 ( ) with scopes test, matched
 - token from 10 to 11 (f) with scopes test, matched, g7f
 - token from 11 to 12 ( ) with scopes test, matched, g7space
 - token from 12 to 14 (z) with scopes test

I expect the f to have the scope test, matched, g7f, g8:

>>> # ...
>>> state, regions = highlight_line(compiler, state, 'a b c d e f z', first_line=True)
>>> import pprint
>>> pprint.pprint(regions)
(Region(start=0, end=1, scope=('test', 'matched', 'g1', 'g2')),
 Region(start=1, end=2, scope=('test', 'matched')),
 Region(start=2, end=3, scope=('test', 'matched', 'g3', 'g4')),
 Region(start=3, end=5, scope=('test', 'matched', 'g3')),
 Region(start=5, end=6, scope=('test', 'matched')),
 Region(start=6, end=8, scope=('test', 'matched', 'g5')),
 Region(start=8, end=9, scope=('test', 'matched', 'g5', 'g6')),
 Region(start=9, end=10, scope=('test', 'matched')),
 Region(start=10, end=11, scope=('test', 'matched', 'g7f', 'g8')),
 Region(start=11, end=12, scope=('test', 'matched', 'g7space')),
 Region(start=12, end=13, scope=('test',)))
alexdima commented 4 years ago

I have tried also in TextMate and they appear to handle this in the way you expect:

image

Here is the grammar converted to TextMate's format:

{   patterns = (
        {   
            match = "((a)) ((b) c) (d (e)) ((f) )";
            name = "matched";
            captures = {
                1 = { name = "g1"; };
                2 = { name = "g2"; };
                3 = { name = "g3"; };
                4 = { name = "g4"; };
                5 = { name = "g5"; };
                6 = { name = "g6"; };
                7 = {
                    patterns = (
                        { match = "f"; name = "g7f"; },
                        { match = " "; name = "g7space"; },
                    );
                };
                8 = { name = "g8"; };
            };
        },
    );
}
RedCMD commented 1 month ago

dup: https://github.com/microsoft/vscode-textmate/issues/164 https://github.com/microsoft/vscode-textmate/issues/208

asottile commented 1 month ago

@RedCMD usually dupe goes the other way since this one is older and has more context