no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
814 stars 65 forks source link

Maximum munch? #159

Open molnarp opened 3 years ago

molnarp commented 3 years ago

Hi,

this is more of a question than an issue about Moo, so here goes:

I have the following lexer:

const lexer = moo.compile({
  TERM: /[a-z]+/,
  PREFIXTERM: /\*|(?:[a-z]+\*)/,
});

On input moo, this will return:

{"type":"TERM","value":"moo","text":"moo","offset":0,"lineBreaks":0,"line":1,"col":1}

On input moo* I would want it to return a single PREFIXTERM, but I'm getting this instead:

{"type":"TERM","value":"moo","text":"moo","offset":0,"lineBreaks":0,"line":1,"col":1}
{"type":"PREFIXTERM","value":"*","text":"*","offset":3,"lineBreaks":0,"line":1,"col":4}

How can I get it to go for a single PREFIXTERM?

tjvr commented 3 years ago

Have you tried swapping the order of the rules? Earlier rules take precedence.

molnarp commented 3 years ago

I can't really do that, because I also have:

WILDTERM: /(?:[a-z*?]+)/,

which is a superset of TERM phrases. In this setup, if the input is mo*o, TERM consumes the prefix, and then PREFIXTERM consumes the asterisk, etc.

This would work, if the longest match was picked. Instead, the earliest match is. I was wondering how to get around this issue.

tjvr commented 3 years ago

I'm afraid I don't exactly understand what you're trying to do.

Moo doesn't choose the regexp with the longest match -- indeed, because it combines all the regexps into a single JS regexp for speed, it can't do this. Instead, the first regexp will match: earlier rules take precedence.

It's hard to provide a solid recommendation without knowing more about the language you're trying to parse. But usually people seem to solve problems that sound like this by: