Proposal: Add option to auto skip whitespace when using lexer

yangfl commented 3 years ago

Maintainer:@kach @tjvr

Since in 99% cases you are really not caring about the whitespace tokens, they are only used for splitting tokens. It's safe to skip whitespace tokens when using a lexer since the lexer already correctly split the tokens for you. For example, to parse arithmetic expressions, one may use

expr -> expr %op expr | "(" expr ")" | %number

instead of

expr -> _ expr _ | expr _ %op _ expr | "(" _ expr _ ")" | %number

, which is much messy and hard to maintain.

I'd suggest adding some code similar to

if (nextColumn.states.length == 0 && this.wsType.includes(token.type)) {
    this.table.pop(nextColumn);
    continue;
}

around

https://github.com/kach/nearley/blob/98e4d21ef9c7836700c0503c10bb0d6465a3c26a/lib/nearley.js#L324-L338

I would be intersted in wirting this feature myself.

hikerpig commented 2 years ago

How is this going on ? I'd love to see this being implemented. One reference for example, I find that ohm-js has explained how it autoskips white spaces https://ohmjs.org/docs/syntax-reference#syntactic-lexical . Also lark-parser https://lark-parser.readthedocs.io/en/latest/grammar.html#ignore

rbadi76 commented 2 years ago

Agreed. I want to test parsing the text "eat food on plates" with the following toy NLP grammar:

S -> V NP | VP PP NP -> N PP VP -> V N PP -> P N V -> "eat" N -> "food" | "plates" P -> "on"

But when I run nearley-test GrammarTest1.js -i "eat food on plates" nearley chokes on the whitespace. I would have to adjust my grammar to include whitespace tokens but that defeats the whole point since I want to test the grammar above, not some new grammar with whitespace tokens. I want to see the Earley items created with the grammar above and the input string given.

kach / nearley

Proposal: Add option to auto skip whitespace when using lexer #549