microsoft / ts-parsec

Writing a custom parser is a fairly common need. Although there are already parser combinators in others languages, TypeScript provides a powerful and well-structured foundation for building this. Common parser combinators’ weakness are error handling and ambiguity resolving, but these are ts-parsec’s important features. Additionally, ts-parsec provides a very easy to use programming interface, that could help people to build programming-language-scale parsers in just a few hours. This technology has already been used in Microsoft/react-native-tscodegen.
Other
353 stars 18 forks source link

Capturing first capture group in Lexer #20

Closed brianwilliams-candide closed 1 year ago

brianwilliams-candide commented 4 years ago

The code below is untested, but before starting a PR I thought it best to start a discussion:

Foreword

In the code example below I have added a const to store the match of the current substring instead of just testing it for a match. The reason for this is that it allows us to specify a capture group within the lexers token definition.

Rationale

I am not sure if its the responsibility of the lexer/ tokeniser but there seems to be an issue with collisions. Take...

[true, /^[a-z]+/g, TokenKind.FieldName], [true, /^[a-zA-Z\s]+/g, TokenKind.FieldLabel],

... for the string "{name:label}".

We tried to implement a parse that had three tokens one for the FieldName and FieldLabel and another for the semicolon (LabelSeparator)

Unless it's just naivety on our part that didn't work as before it gets to the parsing stage the lexer has already fallen over because of a regex collision i.e. unless I specifically make the label uppercase or contain a space it matches for both the FieldName and FieldLabel and picks the first specified in the lexer.

That forced us to specify the FieldLabel with a prefix of semicolon i.e. [true, /^:[a-zA-Z\s]+/g, TokenKind.FieldLabel],

What that now means is that we have to manually strip the semicolon off at the parsing stage. I was wondering if adding support for the capture group syntax would mitigate it this problem.

[true, /^:([a-zA-Z\s]+)/g, TokenKind.FieldLabel],

Using this regex and (something similar to) the code below it would match on the whole regex but only capture the part we want (if specified)

If this is over-engineering of a problem that doesn't exist (which I have a suspicion it might be) by all means please let me know of the appropriate solution.

Code Example

for (const [keep, regexp, kind] of this.rules) {
            regexp.lastIndex = 0;
            const match = regexp.exec(subString);
            if (match) {
                const text = subString.substr(0, regexp.lastIndex);
                let rowEnd = rowBegin;
                let columnEnd = columnBegin;
                for (const c of text) {
                    switch (c) {
                        case '\r': break;
                        case '\n': rowEnd++; columnEnd = 1; break;
                        default: columnEnd++;
                    }
                }

                const newResult = new TokenImpl<T>(this, input, kind, match[0], { index: indexStart, rowBegin, columnBegin, rowEnd, columnEnd }, keep);
                if (result === undefined || result.text.length < newResult.text.length) {
                    result = newResult;
                }
            }
        }
ZihanChen-MSFT commented 3 years ago

Sorry to be responding so late, because this project is not actively maintained as others, and I am the only one that is working on it.

I don't know if I fully understand the issue, but for me it is about how to define the syntax. I believe you could simply write

"{" NAME ":" LABEL "}"

as

"{" NAME ":" (NAME | LABEL ) "}"

and it should just works.

Since LABEL token is a super set of NAME token, by writing the syntax in this way, you don't need to worry about if the parser creates a token stream "{", NAME, ":", NAME, "}" and makes the parser fail.