ollef / Earley

Parsing all context-free grammars using Earley's algorithm in Haskell.
BSD 3-Clause "New" or "Revised" License
361 stars 24 forks source link

Wrong(?) result on some grammars with infinite results #35

Open acertain opened 6 years ago

acertain commented 6 years ago
grammar = mdo
  r <- rule $ (pure 1 <|> ((+1) <$> r))
  pure r

produces just 1, not [1,2,3...]

I don't think it's possible to have this work and

grammar = mdo
  r <- rule $ (pure 1 <|> r)
  pure r

not produce an infinite list of results

I think this is a sensible choice because it currently can't deal with infinite results, but it should be documented somewhere. Theoretically it should be possible to return an infinite lazy list.

ollef commented 6 years ago

Hey, and thanks for your report!

You're right, unless the current behaviour is documented this should probably be considered as a bug.

The reason for the current behaviour is that the epsilon derivations of a rule r (those that don't consume any input) are computed assuming that r has no epsilon derivations. That is why the part using r recursively in your grammars doesn't contribute to the result.

The first idea that might spring to mind then is to tie the knot when computing epsilon derivations and assume that r's epsilon derivations are the result of the computation while computing the epsilon derivations. The problem with this approach is that it makes left-recursive grammars loop.

So instead I'm experimenting with an idea of productivity: While traversing the production looking for epsilon derivations, keep track of whether we've produced any results yet. When encountering r again after we've already produced some results it's safe to tie the knot.

A problem with this is that it's potentially order-dependent:

r <- rule $ pure 1 <|> r

would return [1, 1, 1, ...] because r comes after pure 1, while

r <- rule $ r <|> pure 1

would return [1] because we haven't produced anything yet when encountering r.

Maybe we could reorder the alternatives to make also this return an infinite list but I'm worried that that might not work in general, because you don't know what order to use if there are several productive alternatives.

Do you have any input or ideas? Do you have any more good test cases or important use cases?

Also, what do you mean by "I don't think it's possible to have this work and ... not produce an infinite list of results"? If the first one should return [1, 2, 3, ...], shouldn't the second return [1, 1, 1, ...]?

Cheers!

acertain commented 6 years ago

If the first one should return [1, 2, 3, ...], shouldn't the second return [1, 1, 1, ...]?

That's what I mean

It should be possible to expose multiple variants of rule:

  1. use a HashMap or StableNames etc to dedup results
  2. return infinite results
  3. the current behavior
  4. just loop
  5. something else??

I'm working on a rewrite of the parser where all but 2 should be easy to implement. Any ideas for other variants?

My rewrite doesn't compute epsilon derivations or interpret Prods and instead has instances directly on

newtype Parser s e i a = Parser {
  unParser ::
    ParseEnv s e i ->
    (a -> ParseEnv s e i -> ST s (ParseEnv s e i)) ->
    ST s (ParseEnv s e i)
}

including Monad and a fix :: (Parser s e i a -> Parser s e i a) -> Parser s e i a function

So far, computing epsilon derivations has been avoidable but I think it might be unavoidable for infinite results

ollef commented 6 years ago

Hmm, I'm not too fond of the idea of exposing several variants of rule, unless I can be convinced that there are important use cases for such variants. I think I'd prefer this library just to do the Right Thing, provided we can figure out what that should be. Of course, that doesn't stop you from doing these things in your projects. :)

I think I've come up with something better than what I outlined in my previous post. What do you think of the following?

  1. Compute epsilon derivations for r assuming that r's epsilons are [] (i.e. the current behaviour).
  2. If step 1 produces any results, compute the epsilon derivations again but tie the knot.

This scheme supports left-recursion without looping provided the rule doesn't have any epsilon derivations, and additionally makes your two grammars produce infinite results. It makes r <- r <|> pure 1 loop if you try to force the results but you can still get a report, which I think is reasonable.

So to me it seems promising --- I just have to convince myself that this is the right thing to do in general. I have an implementation here, and the commit before has an implementation of the scheme from my previous message.

Your rewrite sounds intriguing. I'd be interested to hear how you're handling things like left-recursion in it! :)

acertain commented 6 years ago

A wip version of my rewrite is at https://gist.github.com/fread2281/256e47aff8903d7da98d9ea6b4cff63f. A couple things:

acertain commented 5 years ago

Here's a problem(?) with my implementation, that I'm trying to figure out how to fix. I think Earley has the same problem? If not, it'd be very helpful if you could try to explain how Earley avoids it. I've benchmarked my implementation and Earley and they both have the same order of magnitude of function calls (of the most commonly called functions, but it's still within n^3 i think) in the prof files, so I think it still has this problem.

Can merge results at same rule w/ diff start pos & same end pos! (if a calls b at pos i and j, and i-k & j-k both return results, only need to return once with merged results) however, with fuly binarized/memoized grammars, ignoring alts, we can assume that a = b <*> c, but c must be a rule, which does merging for us

But this still adds too many callbacks to c (if b returns at i-k & j-k, then c gets two instances of a in its conts), and callbacks in c cost per position where c succeeds starting from k.

Practical, General Parser Combinators (Meerkat) avoids this by using a HashSet of Conts in rules.

However, this might not cause worst case complexity to go over cubic with fully binarized grammars. According to the meerkat paper,

In Appendix A.3 we show that the execution of memoized CPS recognizers of Figure 2 and 6 can require O(n^(m+1)) operations, where m is the length of the longest rule in the grammar. The reason for such unbounded polynomial behaviour is that the same continuation can be called multiple times at the same input position. As illustrated in Appendix A.3, this happens when the same continuation is added to different continuation lists that are associated with calls made to the same recognizer but at different input positions. If the recognizer produces the same input position starting from different input positions, duplicate calls are made. The duplicate calls further result in adding the same continuations to the same continuation lists multiple times

I think it still affects non-worst case complexity though :(

I don't know if this is avoidable without using StableNames or similar to compare Conts

ollef commented 5 years ago

If I understand you correctly, the question is if we can merge the continuations of invocations of the same rule that start at different positions and end at the same position. Is this what you're asking?

This library doesn't do that, but it doesn't seem to me to be a sound optimisation. Maybe I misunderstood what you meant, though.

acertain commented 5 years ago

There's two ways to deal with the issue in the example:

  1. Merge the continuations of invocations of the same rule that start at different positions and end at the same position and come from the same position in their parent rule: this requires extra bookkeeping per rule, and some way to compare where in the grammar conts return to.
  2. Merge continuations of a rule that return to the same position in the grammar & start from the same position: this still needs some way to compare where in the grammar conts return to.

In the example, 2 would be merging the continuations after they get added to c, so when c returns results in the future it costs less, while 1 would be merging the continuations as b returns (before c gets called).

The Meerkat paper does 2, but I think 1 would work too.

I think they should have the same asymptotics, except with a Monad instance, where 1 can do better.

ChristopherKing42 commented 5 years ago

Maybe you could tie the knot, but then do a breadth first search instead of a depth first search. That way, both r <- rule $ (pure 1 <|> ((+1) <$> r)) and r <- rule $ (((+1) <$> r) <|> pure 1) would return infinite lists. In fact, it would work with sensibly with any binary tree with infinitely many leaves (although a binary tree with infinitely many nodes but only finitely many leaves would diverge after returning all of its leaves).

ollef commented 5 years ago

Yeah, that might work actually!

ChristopherKing42 commented 4 years ago

Also, an idea on how to do that efficiently. First you create a value representing a lazy binary tree, and then you search that binary tree using iterative deepening search. Then, if the compiler fuses the binary tree, it would be equivalent to an imperative iterative deepening search. If it chooses to keep the binary tree in memory instead, it would be equivalent to an imperative breadth first search (except that you are scanning the tree itself in a iterative deepening manner). An optimizing compiler (like GHC) can choose which one is better based on the cost of recalculating the values in the tree, I think.