A way to inspect and choose alternative parsings in a post-processor

KillyMXI commented 2 years ago

I have an issue similar to #382 and #383.

I'm trying to write something like this:

...
value -> ( number | unquotedString ) {% idid %}
...

(idid unpacks two levels of nesting and is there to save me from repeating id for each alternative.)

Clearly, any number can also be interpreted as a string. If it is nested deeply in the whole result of parsing then diffing those alternatives would be a pain. And if there are many of occurrences that will also blow up the number of alternatives exponentially.

I can't use a lexer because it will be confused as well and add tokens inside unquoted strings.

Solutions that may allow me to get what I need:

(more traditional) add special kind of alternative that behaves like conventional "try right side only if left side failed" aka "take first match" - assign a different syntax to it - maybe something like .|;
(more flexible) add a special kind of post-processor (or a way to make regular post-processor block do the same job) - with access to all results of sub-parsers. This will allow to shrink the space of all alternatives early and choose the right one before it is buried deep inside the rest of the parsing result.

While "providing all parsings for ambiguous grammars" is a remarkable feature, it stands in a way of practical tasks more often than one could desire.

I'm also experiencing a growing discomfort with the readme where it's all rainbows and unicorns while in practice it's more like an academic project, undersupported and with important practical limitations.

conartist6 commented 2 years ago

Now that I've done my work with ambiguous grammars I don't think this issue quite makes sense. You can't use postprocess to narrow down ambiguity, because ambiguity can't really be narrowed down. If you eliminate ambiguous alternatives at any time during the parsing, it's possible that you'll eliminate the only alternative that would ultimately end up matching!

conartist6 commented 2 years ago

Also you say "I can't use a lexer because it will be confused as well and add tokens inside unquoted strings".

That shouldn't prevent you using a lexer. It's perfectly possible to create unquoted strings from multiple tokens. Let's say you wanted to write a parser for JSON with bare words. Your input might be:

JSON syntax is made up of the characters {, }, [, ], and ". Here is an example JSON object: {"foo", [false]}

You need to tokenize {, }, [, ], and ", and indeed those tokens appear in your string, but you can just glue all of the unconsumed tokens together by their text values. It still saves the parser a lot of work not to be working through the ambiguity character by character.

KillyMXI commented 2 years ago

If you eliminate ambiguous alternatives at any time during the parsing, it's possible that you'll eliminate the only alternative that would ultimately end up matching!

I think it's the matter of where we try to eliminate them and whether it matters for our grammar.

Going back to my example:

...
value -> ( number | unquotedString ) {% idid %}
...

(Let's say unquotedString stops at linebreaks. Original example came from my use case where it lasts till the end of input, but for the sake of argument here it makes sense to have something to parse after it.)

We can certainly fail to parse an input like 12qwe if we try to eliminate an alternative early (this makes a part of my original proposal inapplicable for the given example, namely .| syntax);
We end up with 2^N parallel "universes" in case when N values like 12 occur in the input and I don't know an efficient way to go through all 2^N of them in parallel to find the best one - this is the major issue.

But if we move the elimination one level higher then we can apply any strategy to choose from remaining branches that didn't fail and we know we can continue from any of them.

number branch for 12qwe will fail before we try to eliminate it, or we can see there is a better alternative, for any definition of "better" that makes sense for us;
both branches for 12 stop at equal offset and safe to choose from;
in certain cases it might be fine to stop looking for alternatives if an unescaped prefix matched - if the prefix parsed then either a certain thing follows up or it's a bad syntax.

Ultimately, for each grammar an effective local elimination/merging strategy might exist (take first, take longest, wrap all, etc...) and there needs to be a way to apply it. Like a post-processor that has access to all local alternatives. Leaving the decision to the very last moment creates too much obstacle to apply it.

I was actually looking for a practical example where I may need to take all alternatives to better illustrate my point but I couldn't find one so far. If I find one, my preference would be to wrap alternatives into an array locally instead of producing completely parallel "universes".

Wrapping is somewhat tricky though - can only merge "universes" where parser stopped at the same offset. It depends on a grammar and it might always be the case for certain grammars. Like if our values always followed by a line break. We have a couple of options - match a line break as a part of a value subparser or a line subparser, but either way we have a clear merge point.

you can just glue all of the unconsumed tokens together by their text values

Sounds like a code smell to me. It might work out fine but it will be less clean than it should be at the very least.

And looks like there was a hole in the logic of why I put a sentence about lexers up there - having a lexer won't save me from the ambiguity on inputs like 12.

kach / nearley

A way to inspect and choose alternative parsings in a post-processor #591