kach / nearley

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.59k stars 232 forks source link

State of discarding tokenizers is sometimes not saved #628

Open rantvm opened 1 year ago

rantvm commented 1 year ago

I have observed that the parser sometimes ignores the state of tokenizers that silently discard some tokens. In particular, the state is ignored if the first input chunk(s) only consist of discarded tokens. This results in the position information of the tokens becoming desynchronized from the input. Below is an example of a tokenizer that exhibits this behaviour.

const discard = { "whitespace": true, "comment": true };
function next() {
    let token;
    do {
         token = /* next token from the buffer */;
    } while (token && discard[token.type]);
   return token;
}

The cause appears to be the below if-statement in combination with the defined behavior of lexer.reset(chunk, info). https://github.com/kach/nearley/blob/6e24450f2b19b5b71557adf72ccd580f4e452b09/lib/nearley.js#L356-L358

This statement seems to assume that if there has been no tokens so far, there is no tokenizer state. Simply always executing this.lexerState = lexer.save() resolves the issue. There may be circumstances where the current behaviour is required (which I am unaware of), so it may be prudent to define a parser option that causes the tokenizer state to always be stored.