kach / nearley

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.6k stars 230 forks source link

Rewrite unparse to always terminate #94

Open bakkot opened 8 years ago

bakkot commented 8 years ago

For example, s -> "a" | s s goes into an infinite loop a good fraction of the time.

There's a couple of solutions to this. The one I implemented in my library for working with CFGs is based on this paper, though if you actually want to go this route I recommend looking at or asking me about this other thing I made instead, which is a simplification and generalization of this algorithm which I haven't gotten around to applying to CFGs yet (but it should be almost trivial to do so).

The other classic solution is to bound depth of the tree and backtrack when you've exceeded the bound without getting rid of all nonterminals.

kach commented 8 years ago

Yes, nearley-unparse is definitely a bit of a hack—notice now it outputs to stdout char-by-char so that infinite loops still "look" like they're doing something! It also, iirc, uses a while loop instead of recursion to avoid JS stack overflows…

I would think that what we have currently is enough for nearley-unparse's main purpose, which is to make sure your grammar isn't obviously flawed in some way… but I have yet to hear of anybody using nearley-unparse in the wild. So maybe I'm completely wrong about that.

I'm not sure I really love the idea of bounding/backtracking: I'm pretty sure that can also end up in an infinite loop unless you're really careful about keeping track of what you've tried already.

I'm much more interested in implementing the algorithm in the paper you linked. If I understand right, it maintains a table of probabilities for each (nonterminal, depth) key, where depth is bounded by N. Then it randomly picks nonterminals to traverse based on this table's probabilities. This prevents it from unbounded recursion because all indefinite paths have probability zero (i.e. zero trees of D exist where D>N). That's pretty cool—I'm looking forward to implementing this and seeing how it goes.

Thanks for the pointer! :-)

(P.S. Nice to see more CFG modules for JS! "Port to a language with a proper type system" made me laugh out loud.)

bakkot commented 8 years ago

Almost! It maintains counts, not probabilities, although probabilities would probably be way better if you could get them normalized appropriately (which might be easy; I haven't thought about it). Counts get very large, very fast, and are therefore kind of expensive. Also it maintains these tables not just for each nonterminal but also for each suffix of the RHS of each rule - which is also painful, since there are now two tables to maintain.

If you don't care that it generate strings with equal probability, you can just keep booleans instead of counts: that is, your table says, for each (nonterminal, length), can that nonterminal produce any string at all of that length. (For the second table, it's the same question except for (nonterminal, length, suffix of a RHS of a rule for that nonterminal).) The rest of the logic should be basically the same.

Mind, it only works for non-nullable grammars, so you need to either make the grammar non-nullable (not that hard, but potentially exponential blowup in size for particularly bad grammars) or consider an epsilon production to have length 1 for the purposes of generating strings (which is the solution I prefer, especially for an application like this).

Edit: also, something like this isn't possible if unparse doesn't terminate; this demo was exactly my use case.

tjvr commented 7 years ago

@bobbybee can you take a look? It's a fun algorithm that you would enjoy implementing.

kach commented 7 years ago

I cheated a bit, but 932c59a5fc2b87dfa46dce4b47997ec2b075f7a2 solves this temporarily. Limitations:

tjvr commented 7 years ago

@Hardmath123 I conclude that this is definitely more complicated than we thought. :P

A depth bound is kinda unfortunate; it's hard to guess what a sensible depth bound would be? But this is definitely better than never terminating! :-)

kach commented 7 years ago

So I actually had a working length-bounded implementation, but the problem was that it doesn't work for nullables. I can try to articulate why, if you want. An example is this:

sum -> sum "+" sum

For each way you can slice up that rule (i.e. each insertion of the dot) you want to compute the number of generated strings of length n. That means iterating over both where-the-dot-is, and also how-you-divide-n-into-x-plus-y where the left half has length x and the right half has length y. But, if you allow nullables, then x and y can potentially be zero which means that the other is equal to n. Now, when you have the call count(1, n), i.e.

sum -> sum • "+" sum

where y=0, then x=n, then your first recursive call to the left, count(1, n), is identical to your current call. :-(

bakkot commented 7 years ago

@Hardmath123 if all you need is a bound, I'm pretty sure you can just pretend X -> '' is of length 1 and do away with that problem (per my second comment above).

kach commented 7 years ago

@bakkot Welp. Yes. Now I feel stupid. I should have kept last night's broken implementation…

By the way, I realize you can derive a depth-bound from a length-bound pretty easily. Or a suggested depth from a given expected length. Would that be good enough, @tjvr?

tjvr commented 7 years ago

@hardmath123 We've already merged this, haven't we? :-) Or are we still hoping someone will do the full probability version?

kach commented 7 years ago

I'm secretly hoping that I (or someone else) will find the time to implement the length-bounded one — currently it's depth-bounded, which is marginally less intuitive for users. So let's leave it open for a little longer. :-)

tjvr commented 7 years ago

How about this? http://eli.thegreenplace.net/2010/01/28/generating-random-sentences-from-a-context-free-grammar