kach / nearley

πŸ“œπŸ”œπŸŒ² Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.57k stars 231 forks source link

A way to inspect and choose alternative parsings in a post-processor #591

Open KillyMXI opened 2 years ago

KillyMXI commented 2 years ago

I have an issue similar to #382 and #383.

I'm trying to write something like this:

...
value -> ( number | unquotedString ) {% idid %}
...

(idid unpacks two levels of nesting and is there to save me from repeating id for each alternative.)

Clearly, any number can also be interpreted as a string. If it is nested deeply in the whole result of parsing then diffing those alternatives would be a pain. And if there are many of occurrences that will also blow up the number of alternatives exponentially.

I can't use a lexer because it will be confused as well and add tokens inside unquoted strings.

Solutions that may allow me to get what I need:

While "providing all parsings for ambiguous grammars" is a remarkable feature, it stands in a way of practical tasks more often than one could desire.

I'm also experiencing a growing discomfort with the readme where it's all rainbows and unicorns while in practice it's more like an academic project, undersupported and with important practical limitations.

conartist6 commented 2 years ago

Now that I've done my work with ambiguous grammars I don't think this issue quite makes sense. You can't use postprocess to narrow down ambiguity, because ambiguity can't really be narrowed down. If you eliminate ambiguous alternatives at any time during the parsing, it's possible that you'll eliminate the only alternative that would ultimately end up matching!

conartist6 commented 2 years ago

Also you say "I can't use a lexer because it will be confused as well and add tokens inside unquoted strings".

That shouldn't prevent you using a lexer. It's perfectly possible to create unquoted strings from multiple tokens. Let's say you wanted to write a parser for JSON with bare words. Your input might be:

JSON syntax is made up of the characters {, }, [, ], and ". Here is an example JSON object: {"foo", [false]}

You need to tokenize {, }, [, ], and ", and indeed those tokens appear in your string, but you can just glue all of the unconsumed tokens together by their text values. It still saves the parser a lot of work not to be working through the ambiguity character by character.

KillyMXI commented 2 years ago

If you eliminate ambiguous alternatives at any time during the parsing, it's possible that you'll eliminate the only alternative that would ultimately end up matching!

I think it's the matter of where we try to eliminate them and whether it matters for our grammar.

Going back to my example:

...
value -> ( number | unquotedString ) {% idid %}
...

(Let's say unquotedString stops at linebreaks. Original example came from my use case where it lasts till the end of input, but for the sake of argument here it makes sense to have something to parse after it.)

But if we move the elimination one level higher then we can apply any strategy to choose from remaining branches that didn't fail and we know we can continue from any of them.

Ultimately, for each grammar an effective local elimination/merging strategy might exist (take first, take longest, wrap all, etc...) and there needs to be a way to apply it. Like a post-processor that has access to all local alternatives. Leaving the decision to the very last moment creates too much obstacle to apply it.

I was actually looking for a practical example where I may need to take all alternatives to better illustrate my point but I couldn't find one so far. If I find one, my preference would be to wrap alternatives into an array locally instead of producing completely parallel "universes".

Wrapping is somewhat tricky though - can only merge "universes" where parser stopped at the same offset. It depends on a grammar and it might always be the case for certain grammars. Like if our values always followed by a line break. We have a couple of options - match a line break as a part of a value subparser or a line subparser, but either way we have a clear merge point.

you can just glue all of the unconsumed tokens together by their text values

Sounds like a code smell to me. It might work out fine but it will be less clean than it should be at the very least.

And looks like there was a hole in the logic of why I put a sentence about lexers up there - having a lexer won't save me from the ambiguity on inputs like 12.