Closed zharinov closed 2 years ago
You don't need pop > 1
to tokenize the example you gave:
const lex = moo.states({
main: {
complex: {match: '${', push: 'interp'},
simple: {match: '$', push: 'simple'},
lit: {match: /[^$]+/u, lineBreaks: true},
},
simple: {
simpleStuff: {match: /\w+/, pop: true},
},
interp: {
complexClose: {match: '}', pop: true},
complexStuff: {match: /[^}]+/u, lineBreaks: true},
},
})
console.log([...lex.reset('com.fasterxml.jackson.core:jackson-annotations:$version')])
console.log([...lex.reset('com.fasterxml.jackson.core:jackson-annotations:${version}')])
Could you give a real-world example of something that can't be tokenized with the current version of moo
? (My original states implementation almost supported pop > 1
, but I couldn't think of any uses for it that weren't just complex/dubious versions of pop: 1
lexers.)
EDIT: I looked at the test case in the PR. You can do that with just next
and pop: 1
. (In your code the tpl
state next
s to itself, which is a no-op.)
const lex = moo.states({
main: {
strstart: {match: '"', push: 'str'},
ident: /\w+/,
space: {match: /\s+/, lineBreaks: true},
},
str: {
strend: {match: '"', pop: true},
tplstart: {match: '$', next: 'tpl'},
content: moo.fallback,
},
tpl: {
strend: {match: '"', pop: true},
tplstart: '$',
ident: /\w+/,
content: moo.fallback,
},
})
console.log(Array.from(lex.reset('"$foo $bar" baz'), x => x.type))
Well, my edge-case is quite specific as I need to handle $foo.bar
and ${foo.bar}
in the same way: strstart tplstart sym dot sym tplend strend
. I made my best to keep both variations as close as possible, but there may be undesired side-effects here and there.
Now I'm thinking towards constructing a single regex-based token type for the "simple" variation and post-process its inner value with simpler parser. It requires more code, but will be more precise.
Sorry for distracting you, I'll close this PR if you don't mind.
And thank you for the quick response 😉
No worries! If you find yourself needing tokens that don't represent any characters in the input (like tplend
in the un-braced example), it's a good sign you should post-process the token stream. The easiest way to do that is to use more specific token names like tplstartunbraced
, then detect each tplstartunbraced
, rename it to tplstart
and insert a tplend
after the last sym
that follows it. (See, e.g., this post-processing example for whitespace sensitivity.) Matching too much in a single token and re-lexing is usually slower.
First of all, I'm happy to finally appreciate this library which helps us a lot with our parsers for Renovate.
The problem we've encountered is how to easily parse different styles of string template literals:
com.fasterxml.jackson.core:jackson-annotations:$version
com.fasterxml.jackson.core:jackson-annotations:${version}
This PR implements support for
pop
values higher than 1 which seems to be enough to solve problems like this.