no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
821 stars 65 forks source link

What's the best way to handle similar numeric tokens? #84

Closed newtang closed 6 years ago

newtang commented 6 years ago

If I wanted to parse something like month and year (e.g 10/2018 or 10/18), how can I make distinct tokens for month and year without the parser getting confused? (I'm using Moo with Nearley).

@{%
const moo = require("moo");

const lexer = moo.compile({
  delim: /[\/\-]+/,
  month: /(?:[0-1][0-9])|[1-9]/,
  year: /[0-9]{4}|[0-9]{2}/
});
%}

@lexer lexer
expression -> %month %delim %year

I'll get an error:

Error: invalid syntax at line 1 col 4:

  10/2
     ^
Unexpected month token: "2"

If I reverse the order of month and year to the object I pass to moo.compile, I would get this error:

Error: invalid syntax at line 1 col 1:

  10
  ^
Unexpected year token: "10"

What is the best strategy for handling something like this? I prefer keeping the distinct tokens so I can catch errors.

nathan commented 6 years ago

If you really need to do this in a lexer, you can use lookahead:

const lexer = moo.compile({
  delim: /[\/-]+/,
  month: /(?:[01]\d|[1-9])(?=[\/-])/,
  year: /\d{4,}|\d{2}/, // Years can be more than 4 digits long, by the way
})

But you should probably just match on (?:[01]\d|[1-9])[\/-](?:\d{4,}|\d{2})) and parse that further when you need to, or lex \d+ and [-/] and handle the higher-level syntax and validation in your parser.

(The example you gave, 10/2, doesn't fit your original description of mm/yyyy or mm/yy; I assume you mean 10/02.)

Edit: Note that the above lexer will do bizarre things if you give it input like 10/891/ (= 10 / 89 1 /). All the more reason to simplify your lexer to \d+ and [-/].

newtang commented 6 years ago

Got it, thanks @nathan!