no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
824 stars 66 forks source link

Support for `pop` values higher than 1 #177

Closed zharinov closed 2 years ago

zharinov commented 2 years ago

First of all, I'm happy to finally appreciate this library which helps us a lot with our parsers for Renovate.

The problem we've encountered is how to easily parse different styles of string template literals:

This PR implements support for pop values higher than 1 which seems to be enough to solve problems like this.

nathan commented 2 years ago

You don't need pop > 1 to tokenize the example you gave:

const lex = moo.states({
    main: {
        complex: {match: '${', push: 'interp'},
        simple: {match: '$', push: 'simple'},
        lit: {match: /[^$]+/u, lineBreaks: true},
    },
    simple: {
        simpleStuff: {match: /\w+/, pop: true},
    },
    interp: {
        complexClose: {match: '}', pop: true},
        complexStuff: {match: /[^}]+/u, lineBreaks: true},
    },
})

console.log([...lex.reset('com.fasterxml.jackson.core:jackson-annotations:$version')])
console.log([...lex.reset('com.fasterxml.jackson.core:jackson-annotations:${version}')])

Could you give a real-world example of something that can't be tokenized with the current version of moo? (My original states implementation almost supported pop > 1, but I couldn't think of any uses for it that weren't just complex/dubious versions of pop: 1 lexers.)

EDIT: I looked at the test case in the PR. You can do that with just next and pop: 1. (In your code the tpl state nexts to itself, which is a no-op.)

const lex = moo.states({
  main: {
    strstart: {match: '"', push: 'str'},
    ident:    /\w+/,
    space:    {match: /\s+/, lineBreaks: true},
  },
  str: {
    strend:   {match: '"', pop: true},
    tplstart: {match: '$', next: 'tpl'},
    content:  moo.fallback,
  },
  tpl: {
    strend:   {match: '"', pop: true},
    tplstart: '$',
    ident:    /\w+/,
    content:  moo.fallback,
  },
})
console.log(Array.from(lex.reset('"$foo $bar" baz'), x => x.type))
zharinov commented 2 years ago

Well, my edge-case is quite specific as I need to handle $foo.bar and ${foo.bar} in the same way: strstart tplstart sym dot sym tplend strend. I made my best to keep both variations as close as possible, but there may be undesired side-effects here and there.

Now I'm thinking towards constructing a single regex-based token type for the "simple" variation and post-process its inner value with simpler parser. It requires more code, but will be more precise.

Sorry for distracting you, I'll close this PR if you don't mind.

zharinov commented 2 years ago

And thank you for the quick response 😉

nathan commented 2 years ago

No worries! If you find yourself needing tokens that don't represent any characters in the input (like tplend in the un-braced example), it's a good sign you should post-process the token stream. The easiest way to do that is to use more specific token names like tplstartunbraced, then detect each tplstartunbraced, rename it to tplstart and insert a tplend after the last sym that follows it. (See, e.g., this post-processing example for whitespace sensitivity.) Matching too much in a single token and re-lexing is usually slower.