lukehutch / pikaparser

The Pika Parser reference implementation
MIT License
141 stars 12 forks source link

[question] `canMatchZeroChars` in cycles #34

Closed exaexa closed 2 years ago

exaexa commented 2 years ago

Hello,

while reimplementing pikaparser for another language (julia), I noticed a possible small discrepancy when initializing canMatchZeroChars values for clause objects in grammar cycles. Imagine a grammar with a simple cycle and a single terminal t such as this:

Start ← A
A ← B?
B ← C A
C ← t?

Here, in topological order, C can match zero chars, B should be able to match zero chars because both A and C can match zero chars, and A (resp. Start) can trivially (resp. transitively) match zero chars.

Despite of that, the topological-ordering initialization seems to initialize B with canMatchZeroChars=false, because the value at A is initially false.

Semantically this is probably not a very big deal, but I wonder if this could have an effect on grammar matching, such as in case there would be a rule:

D ← B t

...which (I guess) would at this point not be able to match with a zero-char B.

I understand that my above grammar is technically disallowed with B having canMatchZeroChars=true, because B? is equivalent to B|ε which would be erroneous if B can match zero chars. Is there a proof that one really can not construct a valid grammar that would trigger this possible discrepancy with initializaiton of canMatchZeroChars? I didn't find any counterexample, but that ofc doesn't prove much :]

Thank you very much for any help!

exaexa commented 2 years ago

(on a side note, the implementation is now here: https://github.com/exaexa/PikaParser.jl )

lukehutch commented 2 years ago

Hi @exaexa,

Super awesome that you implemented this in Julia! Thanks for the link. (PS not a big deal, but my last name is Hutchison, not Hutchinson.)

I see the problem. Thanks for pointing this out! The solution would be to iterate through the reverse-topologically-ordered list of clauses more than once, until there are no more changes of canMatchZeroChars from false to true for any grammar clause. So determineWhetherCanMatchZeroChars should return true if it flips this value, and this loop should check if any of the calls flip the value, and if so, the whole loop should be run again:

https://github.com/lukehutch/pikaparser/blob/dd89e19755acd5324e614db0174c12fecf249fa4/src/main/java/pikaparser/grammar/Grammar.java#L151

Something like the following, and then the implementations of determineWhetherCanMatchZeroChars each need to return the correct value:

        for (boolean changed = true; changed; ) {
            changed = false;
            for (Clause clause : allClauses) {
                if (clause.determineWhetherCanMatchZeroChars()) {
                    changed = true;
                }
            }
        }

Does this look like the correct fix to you?

exaexa commented 2 years ago

Thanks! The algorithm seems okay (even I'd say it can be made much simpler -- you are basically coloring stuff to true, and keep a queue of rules that are still false and might need rechecking, which should get roughly to O(n), as in flood fill), so let's call this solved. :]

I was more worried about whether marking more rules as "matchable for free" wouldn't break some other stuff but as far as I see it it's probably okay. I was worried that someone would be able to construct a grammar that contradicts this, but I guess that's not a case.

On a side question -- this was one of the places where topo ordering was necessary (but in this case it's removed). Is there any other reason to keep the rules ordered topologically? As far as I've understood, the ordering property isn't really used anywhere else, except for making the memo table look less messy for presentation.

lukehutch commented 2 years ago

Yes, the dynamic programming algorithm also depends upon the topological ordering, although if I remember correctly, that is automatic or implicit, by simply starting from the terminals and then scheduling the next round of clauses upwards (all clauses that have those terminals as subclasses).

lukehutch commented 2 years ago

I'll actually reopen this because you raised a great point, and I want to fix this in the reference parser.

You're right that a flood fill would be the most efficient way to implement this. But I don't think you'd almost ever need two iterations through all the clauses to color everything true that needs to be colored true. So I think doing it the more inefficient but simpler way that I suggested is fine.

exaexa commented 2 years ago

But I don't think you'd almost ever need two iterations

Well :D

Imagine you take the gadget from the OP of this issue and connect it to the rule B of the same gadget, say B <- Gadget C A. That afaik needs 3 runs (Gadget needs 2, and is needed to resolve the higher-level gadget). I guess this can be chained.

lukehutch commented 2 years ago

Sorry that was a typo, it should say "more than two".

In general you'd need 2 iterations per cycle, and only if cycles are connected. (Also a third iteration where nothing changes, based on how I gave the algorithm, but that could be avoided with some complexity).

For more than that number, you'd need complex interlocking cycles, and it's quite hard to write grammars that include that sort of structure. But the number of passes needed should only increase roughly linearly with the size of the grammar.

lukehutch commented 2 years ago

OK, I committed a fix for this. It seems to work OK but I haven't tested it extensively. Feel free to try the fix and let me know if it works for you too. Thanks again for your astute observation that led to finding this problem, @exaexa!

exaexa commented 2 years ago

Sorry that was a typo, it should say "more than two".

yeah I understood it like that, didn't notice "more" is missing :D The counterexample for 3 iterations there is seriously a pathological case; I just wanted to highlight that you can't claim the iteration count is going to be always <= 2 but for reasonable grammars it will.

Anyway, thanks for all feedback!

lukehutch commented 2 years ago

You're right. In the worst case the performance will be O(C^2) in the number of clauses. I'm OK with that.