Closed exaexa closed 2 years ago
(on a side note, the implementation is now here: https://github.com/exaexa/PikaParser.jl )
Hi @exaexa,
Super awesome that you implemented this in Julia! Thanks for the link. (PS not a big deal, but my last name is Hutchison, not Hutchinson.)
I see the problem. Thanks for pointing this out! The solution would be to iterate through the reverse-topologically-ordered list of clauses more than once, until there are no more changes of canMatchZeroChars
from false
to true
for any grammar clause. So determineWhetherCanMatchZeroChars
should return true
if it flips this value, and this loop should check if any of the calls flip the value, and if so, the whole loop should be run again:
Something like the following, and then the implementations of determineWhetherCanMatchZeroChars
each need to return the correct value:
for (boolean changed = true; changed; ) {
changed = false;
for (Clause clause : allClauses) {
if (clause.determineWhetherCanMatchZeroChars()) {
changed = true;
}
}
}
Does this look like the correct fix to you?
Thanks! The algorithm seems okay (even I'd say it can be made much simpler -- you are basically coloring stuff to true
, and keep a queue of rules that are still false and might need rechecking, which should get roughly to O(n), as in flood fill), so let's call this solved. :]
I was more worried about whether marking more rules as "matchable for free" wouldn't break some other stuff but as far as I see it it's probably okay. I was worried that someone would be able to construct a grammar that contradicts this, but I guess that's not a case.
On a side question -- this was one of the places where topo ordering was necessary (but in this case it's removed). Is there any other reason to keep the rules ordered topologically? As far as I've understood, the ordering property isn't really used anywhere else, except for making the memo table look less messy for presentation.
Yes, the dynamic programming algorithm also depends upon the topological ordering, although if I remember correctly, that is automatic or implicit, by simply starting from the terminals and then scheduling the next round of clauses upwards (all clauses that have those terminals as subclasses).
I'll actually reopen this because you raised a great point, and I want to fix this in the reference parser.
You're right that a flood fill would be the most efficient way to implement this. But I don't think you'd almost ever need two iterations through all the clauses to color everything true that needs to be colored true. So I think doing it the more inefficient but simpler way that I suggested is fine.
But I don't think you'd almost ever need two iterations
Well :D
Imagine you take the gadget from the OP of this issue and connect it to the rule B
of the same gadget, say B <- Gadget C A
. That afaik needs 3 runs (Gadget needs 2, and is needed to resolve the higher-level gadget). I guess this can be chained.
Sorry that was a typo, it should say "more than two".
In general you'd need 2 iterations per cycle, and only if cycles are connected. (Also a third iteration where nothing changes, based on how I gave the algorithm, but that could be avoided with some complexity).
For more than that number, you'd need complex interlocking cycles, and it's quite hard to write grammars that include that sort of structure. But the number of passes needed should only increase roughly linearly with the size of the grammar.
OK, I committed a fix for this. It seems to work OK but I haven't tested it extensively. Feel free to try the fix and let me know if it works for you too. Thanks again for your astute observation that led to finding this problem, @exaexa!
Sorry that was a typo, it should say "more than two".
yeah I understood it like that, didn't notice "more" is missing :D The counterexample for 3 iterations there is seriously a pathological case; I just wanted to highlight that you can't claim the iteration count is going to be always <= 2
but for reasonable grammars it will.
Anyway, thanks for all feedback!
You're right. In the worst case the performance will be O(C^2)
in the number of clauses. I'm OK with that.
Hello,
while reimplementing pikaparser for another language (julia), I noticed a possible small discrepancy when initializing
canMatchZeroChars
values for clause objects in grammar cycles. Imagine a grammar with a simple cycle and a single terminalt
such as this:Here, in topological order, C can match zero chars, B should be able to match zero chars because both A and C can match zero chars, and A (resp. Start) can trivially (resp. transitively) match zero chars.
Despite of that, the topological-ordering initialization seems to initialize B with
canMatchZeroChars=false
, because the value at A is initially false.Semantically this is probably not a very big deal, but I wonder if this could have an effect on grammar matching, such as in case there would be a rule:
...which (I guess) would at this point not be able to match with a zero-char B.
I understand that my above grammar is technically disallowed with B having
canMatchZeroChars=true
, becauseB?
is equivalent toB|ε
which would be erroneous if B can match zero chars. Is there a proof that one really can not construct a valid grammar that would trigger this possible discrepancy with initializaiton ofcanMatchZeroChars
? I didn't find any counterexample, but that ofc doesn't prove much :]Thank you very much for any help!