kach / nearley

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.6k stars 230 forks source link

Ambiguous grammars and lazy parsing (and an experiment) #399

Open pipcet opened 6 years ago

pipcet commented 6 years ago

I have a highly-ambiguous Marpa grammar, which I've translated to a highly-ambiguous nearley grammar. I'd like to quickly check whether a given string can be parsed according to this grammar, and give me an arbitrary parse result, without running out of memory because there are millions of ambiguous parsings.

I think it would be really neat if there were a way to do this, and I've implemented one way of doing it as an experiment at https://github.com/kach/nearley/compare/master...pipcet:lazy?expand=1.

The idea is that instead of working through all states in a given column, we aggressively pursue those states that have reached the maximum column, updating the wantsBy tables as we go.

Implemented with JS generators, at whatever performance cost those currently come at.

I'm not sure whether to do any more with this; I'm sure there are better solutions for ambiguous grammars already out there, but I figured this was a good opportunity to really understand how nearley does its magic.

Many drawbacks, of course:

So this is mainly a notice that this exists, and an offer to work on this code some more and submit it for inclusion if there's interest in it, maybe with an alternative API for those heretics who really prefer to work with ambiguous grammars.

Currently passes 55 tests (failing 4, which need to rewind).

In any case, thanks a lot for nearley! It's pleasant to use and to play around with.

kach commented 6 years ago

Very cool! Thanks for sharing your experiment. I'm glad you could find your way around the nearley codebase, it means we must have done something right… =)

I do find the idea of using heuristics to optimize Earley interesting.

I'm sure there are better solutions for ambiguous grammars already out there

Well, yes and no. Tim wanted to implement SPPF-style structures (#163) at one point, which I think would help with this(?). You might be interested in reading the literature on that. =)

tjvr commented 6 years ago

Thanks for sharing! I'm glad you're enjoying Nearley :-)

I haven't had time to look at your experiment yet, but I can offer some brief hints.

If all you want to do is decide whether a string is valid, without generating the actual derivations, then you may get some mileage out of removing or commenting out these two lines: https://github.com/kach/nearley/blob/5be0b87e90078ed8d9839680472e39ecbe10412d/lib/nearley.js#L69 https://github.com/kach/nearley/blob/5be0b87e90078ed8d9839680472e39ecbe10412d/lib/nearley.js#L51

That will stop Nearley from actually building the result nodes, which might help with memory usage a little bit. It won't help with complexity unfortunately (unless your postprocessors have quadratic behaviour, like the EBNF modifiers sadly do).

As Kartik mentions, I've long had aspirations to switch the core algorithm to an SPPF-style Earley parser. The main change, which is very relevant here, is that identical states in a column are merged. So if there are two derivations for foo (e.g. foo -> A and foo -> b), rather than two States, this would be one State with two derivations. This would avoid (or at least reduce?) the current exponential blowup when parsing very ambiguous grammars/inputs.

Of course, you also have to move the state building to the end to get any mileage out of this; which means running all the postprocessors after you're done parsing. (Which is why I want to remove reject.) It also means you don't need to run postprocessors whose results are eventually discarded and don't make it into the final parse.

The SPPF part is just that each derivation is a sort of linked list pointing to all the child States used in deriving that rule.

Sent with GitHawk