kach / nearley

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.63k stars 231 forks source link

Implement Aycock-Horspool precomputation of nullable token sets #11

Open aredridel opened 10 years ago

aredridel commented 10 years ago

from Aycock & Horspool "Practical Earley Parsing", The Computer Journal, Vol. 45, No. 6, 2002

kach commented 10 years ago

Again, I'd be happy to introduce the optimization if I understood it. :-)

aredridel commented 10 years ago

There's two parts here: Build a DFA to handle the state transitions, basically precomputing the Earley sets for each state.

Then, because nullable rules screw that up, iteratively merge the other possible states into each state in the DFA, based on the nullability of the next rule, effectively making the nulls disappear. (Caveat in the postprocessing function)

Then run the predictor part of the algorithm via DFA rather than the direct Earley approach. Super fast.

aredridel commented 10 years ago

Marpa's paper describes both Leo and Aycock-Horspool's methods much better than the original papers.

http://dinhe.net/~aredridel/.notmine/PDFs/Marpa,%20a%20practical%20general%20parser:%20the%20recognizer.pdf

jeffreykegler commented 10 years ago

I'm happy to see an alternate implementation of Marpa. I thought I'd share some insights ahead of time from a revised paper on Marpa.

First, after considerable experience and looking at the numbers, I decided that the LR(0) state-driven approach of the original Marpa implementation was not worth the extra complication --- basically it was counter-productive. Marpa continues to use the ideas of Aycock & Horspool on using a rewrite to deal with nullable symbols and rules. And in fact, Marpa uses the rewrite to deal with other issues.

There'll be more in my new Marpa paper. And of course my code, when contains the re-implementation is public and open source.

aredridel commented 10 years ago

Oh interesting! I'd have expected the reduction in running time by N[symbols] to make a huge difference, especially for large rule-sets. I'll definitely keep an eye out for the new paper!

jeffreykegler commented 10 years ago

So did I. But at this point I suspect I have more experience with this approach than anyone, including Aycock and Horspool themselves. The numbers I ran for Perl, C and HTML are on my mailing list. Bottom line -- the gain for practical languages is almost nothing except for prediction states. And prediction states can be optimized in other ways.

The current implementation of Marpa has reverted to the classic Earley items of Jay Earley's paper. It still does use Joop Leo's optimization & my rearrangement of the parse engine to allow procedural parsing.

If you are following Marpa as closely as it seems you might, perhaps you'd like to join the marpa-parser mailing list, the IRC channel, or both. (The IRC channel is back-logged, which is convenient for lurking.)

aredridel commented 10 years ago

Indeed, I will do just that -- I've been doing quite a lot of reading. Your earlier paper is probably the clearest I've read on the subject by far.

What other ways have you found to optimize prediction states?

kach commented 10 years ago

I'm still not sure I entirely understand either of the optimizations (high school freshman!). If I understand correctly, you're trying to optimize the part where we add more rules to the row of the table based on what nonterminals could come after the dot…?

jeffreykegler commented 10 years ago

@Hardmath123 -- I assume you've read some of the descriptions of Earley's algorithm. Because you want to be sure first you know it inside and out. Then look at the algorithm-oriented writeups in my blog -- there's an annotated list.

It is hard -- there are layers on layers and it took me years. (I'm 60 years old, I've had time to do this.)

From there, you might want to wait for my paper. I hope it won't take long.

jeffreykegler commented 10 years ago

@aredridel -- I've seem your collection of parsing papers, which is excellent and I feel honored to see mine in such a select group.

Re the prediction state optimizations. There's nothing surprising. Note that predictions in any Earley set always have the same origin and current location, so the only thing that varies is the rule. So in effect, you're manipulating a bit vector, with one bit per rule. Since predictions can be dealt with in a bit vector, there's not a lot of point to reducing the number of Earley items they require -- you can do predictions with zero Earley items.

aredridel commented 10 years ago

Indeed. The "prediction" part of the algorithm can, in theory, be precomputed: So you get a symbol, and just look at the list of states to add, rather than looping over the list to add all the predictions that match right now. It should reduce O(n) to O(1) for that step. That said, it's not the simplest precomputation to do. (I've been working on it since I opened this issue!)

jeffreykegler commented 10 years ago

I'm happy to continue this here if you like, but maybe the IRC channel would be more interactive: #marpa on irc.freenode.net

tjvr commented 9 years ago

Marpa continues to use the ideas of Aycock & Horspool on using a rewrite to deal with nullable symbols and rules. And in fact, Marpa uses the rewrite to deal with other issues.

Hi @jeffreykegler. What issues are you talking about? What advantage does converting to NNF get you? Thanks!

jeffreykegler commented 9 years ago

@blob8108 The set of issues that I do rewrites for is evolving. At this point the main use, other than for nullables, is for statements which extend BNF -- sequence rules, and precedence rules. These are implemented via a rewrite into BNF.