Add error recovery mode?

no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.

BSD 3-Clause "New" or "Revised" License

821 stars 65 forks source link

Add error recovery mode? #56

Closed deltaidea closed 7 years ago

deltaidea commented 7 years ago

In my language, lines with invalid stuff are considered comments. I know, insane, but I'd like to support that if possible. Currently, { error: true } is extremely greedy and considers everything from the first error to be a single token.

I propose an optional error tolerance mode that can be enabled with { error: true, recover: true }:

if can't parse a token starting with current position position:
- if not already in recovery mode, save current position as recovery starting point
- increment position (skip current character)
if can parse a token && recovery starting point exists:
- return a token with name of error token type, line and col of recovery starting point
- delete recovery starting point (so we return current valid token next time)

nathan commented 7 years ago

moo.compile({
  id: /\w+/,
  ws: {match: /\s+/, lineBreaks: true},
  // … rules rules rules …
  ignore: /.+/, // skip to eol
})

If you meant the entire line gets ignored, not just the trailing lexically invalid part, that's a job for the parser, not the lexer, because there are almost always sequences of lexically valid tokens that are not syntactic. (For example, + - is a sequence of JS tokens that is not syntactic.)

deltaidea commented 7 years ago

I tried that approach:

compile({
  ...
  lCurly: '{',
  rCurly: '}',
  invalid: /.+/
})

rCurly gets moved to the list of keywords matched by invalid. Then the whole line } // comment gets parsed as invalid which doesn't match rCurly. I could do /[^{}]+/ but it gets very messy with negative lookaheads for tokens like #define. It's easy to forget to add new token to invalid regexp and hard to debug the consequences. I'd obviously prefer a general solution upstream.

I'm willing to try and implement this in a PR if you guys think it's a good idea.

deltaidea commented 7 years ago

You made me think of a much simpler way to implement it:

let errorRe = /(?:(?!<re>).)+/my // <re> is `lexer.re`, i.e. all the valid stuff.

When can't parse a valid a token, errorRe.exec(input) matches everything right up to the next one.

nathan commented 7 years ago

I still don't think that error recovery at the level of lexical analysis is what you want. Could you provide some examples from your language?

tjvr commented 7 years ago

the whole line } // comment gets parsed as invalid which doesn't match rCurly.

Argh. I knew making keyword handling implicit would be a bad idea. Perhaps this is another reason to make keyword handling explicit (#53).

tjvr commented 7 years ago

Here are some suggestions:

We recommend not implementing error recovery as part of your lexer (as @nathan says).
Since keyword handling is now explicit (#57), your excerpt from above should now behave as expected.
If you really want to do this, I think the right place is another library on top of moo; not in the moo core itself. Sorry! :-)