Closed YafahEdelman closed 7 years ago
That's not a well-written grammar, though. You should write num -> num "+" [0-9]:+ | [0-9]:+
. If there's an ambiguous grammar, the assumption is that you want all possible parsings (I can think of use-cases where this is a legitimate assumption).
But it would be nice to have there be an option rather than have nearley just make this assumption. Do you have any ideas on how to use this?
Well, I think it would be better to write a linting service that warns you if your grammar is ambiguous. In fact, that would be really nice! Also, algorithmically nontrivial, because I can't quickly think of a way to statically determine if a grammar is ambiguous.
After a bit of searching around because making such an algorithm did seem quiet interesting I discovered that it was extremely non-trivial, aka, not computable. Although maybe it will be possible to make a heuristic that works in a lot of basic cases, I'll have to research some more. Edit: Maybe its only not computable for general grammars? I'm not sure, I'll keep researching.
Well, I think it would be better to write a linting service that warns you if your grammar is ambiguous.
You know that's undecidable in general, right? As JacobEdelman points out above.
Although its undecidable in general it is possible to add a mode that allows precedence and will only find one possible parsing. I'm still not sure exactly what kind of speed increase this would give you but I've implemented a (toy) parser that can do it.
Because Earley builds a parse table with all possible states, how can you avoid computing all possible parsings?
I'm not sure what you mean by adding precedence, but that sounds like a bad idea. @hardmath123 is a fan of using BNF to express operator precedence :-)
A general outline is that you can selectively compute parts of the parse table in an attempt to compute the final parsing. If your parsing was not successful you compute more of the parse table. I'll look at doing that but I do agree that if it does end up causing a dramatic speed increase it won't be worth it.
You could build the shared packed parse forest (SPPF) as you go. Elizabeth Scott noted this in her 2008 paper "SPPF-Style Parsing From Earley Recognisers" .
Marpa's Jeffery Kegler uses the term Bocage to denote this structure as you can see here .
Hmm, @patrickhuber, that looks pretty interesting. From what I can tell, this should allow us to create some sort of lazy iterable data structure where each iteration results in one valid parsing. The obvious advantage is that for very ambiguous grammars with exponential parses, it would be more memory-efficient. On the other hand, building an SPPF takes time, and so there's a speed tradeoff. This sounds like a fun experiment.
(Again, my first choice is that people just learn to write unambiguous grammars—in 99.9% of cases, it's very obvious how to write your grammar unambiguously and you don't actually want to enumerate all possible parsings anyway.)
Near the bottom of the paper she creates a inline representation that runs at the same time complexity as the Earley algorithm. Jeffery Kegler stated in the marpa forums that building the parse tree using binary, reusable nodes follows the same complexity as recognition. On trivial samples, you may notice a very small bit of overhead, but as the size of the parse grows, the overhead will be inconsequential.
I agree that the easy option is to rewrite the grammar, but part of the appeal of the Earley algorithm is its ability to handle ambiguity with ease.
Whoo, okay, I'm sold. Time to dig into some papers and find someone smart enough to teach me how to implement it (@JacobEdelman ? @blob8108 ? @rwindelz ?).
I'll look it over later today/tomorrow. On Apr 14, 2015 6:48 PM, "Hardmath123" notifications@github.com wrote:
Whoo, okay, I'm sold. Time to dig into some papers and find someone smart enough to teach me how to implement it (@JacobEdelman https://github.com/JacobEdelman ? @blob8108 https://github.com/blob8108 ? @rwindelz https://github.com/rwindelz ?).
— Reply to this email directly or view it on GitHub https://github.com/Hardmath123/nearley/issues/53#issuecomment-93101433.
If it helps, I have a implementation here in C#. https://github.com/patrickhuber/Pliant/tree/shared-packed-parse-forest/libraries/Pliant. It doesn't currently work for Leo Items. Main body of code is in the PulseRecognizer.cs.
Ah, nice to know. This will be a handy reference (the collection of links in your README is also very useful!).
(We haven't implemented Leo optimization yet, so that's not an issue. I should really go read about that soon.)
Personally, I would be very happy if there was some boolean parameter I could give the parser (say, "disable_ambiguity"), that would throw an exception if detecting an ambiguous input.
Obviously we shouldn't write ambiguous grammars if we don't want them, but we also shouldn't write bugs in general, and that still happens some time. And some grammars can be pretty big. It'd be nice to have a graceful exit, instead of a hang / memory error.
See, the problem with this is that since nearley is streaming, you never know if the input is "ambiguous" because the ambiguity might be resolved later. Let me give you an example:
main -> cow_or_colorado | cone_or_colorado
cow_or_colorado -> "CO" "W":?
cone_or_colorado -> "CO" "NE":?
Suppose I gave this parser "C" and "O". Now, it thinks you could either have a cow_or_colorado or a cone_or_colorado—but it can't throw an error yet! This is because if you feed the parser "W" or "NE", the ambiguity will get resolved in favor of the former or latter, respectively. Nearley never knows when you're done giving it input.
This means, by the way, that you could always detect ambiguity on your end by looking at the results
array and ensuring that its length is exactly 1.
Well, the reason I ask is that I wrote a python version of nearley (just the parser) for my parsing library, and I want to be able to protect my users from writing ambiguous grammars that practically never end. Right now it's very hard to discover exactly which rule is ambiguous.
You can see it here btw: https://github.com/erezsh/plyplus/blob/master/plyplus/pearley.py It works very well, but unfortunately it's still very slow (10x time compared to PLY).
@erezsh On the subject of optimisation: I wrote a python version a while ago, it might have some useful optimisations
@erezsh pearley looks really cool! Very exciting.
I notice most of the code seems to be a direct port of nearley's—in this case, I think I can answer your comment on line 67 (https://github.com/erezsh/plyplus/blob/master/plyplus/pearley.py#L67). Let me know if you'd like that.
Again, as I've said, you can't detect ambiguity automatically. Not possible. The solution is to inspect the output with small test cases—then it becomes really easy to see, intuitively, what is getting parsed in multiple ways. Perhaps you can figure out a clever way to automate this process—if you do, let me know and I would be happy to include it in the nearley package. :-)
My 5 cents: being able to enumerate exponential number of parses is very important and is what makes nearley, um, the parser of choice (for NLP and for generated grammars at least). Of course it makes it slower at times... but hey, universality is almost always at odds with performance.
btw the grammar num -> num "+" num | [0-9]:+
does not in fact generate ambiguous parses in playground at least, how can get my 2^n parses there? :)
It's certainly supposed to show all parses… I think the author is using an outdated version of nearley, a version which had a bug that suppressed ambiguous output in some cases (including your case). Not sure, though, because the client JS is packed.
Closing because #163 can continue this discussion.
I know its a long shot and not really something earley was meant for, but are there any methods or optimizations we can implement to deal with ambiguous grammars with exponential parsings? For instance:
This seems simple, but nearley can't deal well with large statements of this type because the amount of possible parsings is exponential with the number of plus signs. Could we implement the option of limiting the number of parsings?