ollef / Earley

Parsing all context-free grammars using Earley's algorithm in Haskell.
BSD 3-Clause "New" or "Revised" License
363 stars 24 forks source link

Add simple benchmark #6

Closed phadej closed 9 years ago

phadej commented 9 years ago

I was interested how this compares to parsec, so added a benchmark suite:

screen shot 2015-05-07 at 09 11 52

I guess it's good to have, so one can catch performance regressions too; though I don't know how to automate that part.

ollef commented 9 years ago

Cool! This will be immensely useful to have. I have some comments:

  1. BenchAll.hs requires FlexibleContexts using recent GHC versions as extensions are now required even when they are only implicitly used.
  2. I think the benchmark is a bit unfair because the two parsers do not support the same language. The Parsec parser notably doesn't support parentheses. The Earley parser could perhaps be rewritten to match the style of the Parsec parser:
sepBy1 :: Prod r e t a -> Prod r e t op -> Grammar r e (Prod r e t [a])
sepBy1 p op = mdo
  ops <- rule $ pure [] <|> (:) <$ op <*> p <*> ops
  return $ (:) <$> p <*> ops

expr :: Grammar r String (Prod r String Token Expr)
expr = mdo
  let var = Var <$> satisfy isIdent
  mul <- fmap (foldl1 Mul) <$> (add `sepBy1` symbol "*")
  add <- fmap (foldl1 Add) <$> (var `sepBy1` symbol "+")
  return mul

Though this is arguably what we are trying to avoid by using Earley. Maybe we should have both versions? I'm not sure.

  1. The benchmarks have more overhead than necessary that stems from the creation and lexing of the input string. Maybe something like this (for both parsers) would save us some of that work:
earleyBench n = parseEarley $ "x" : concat (replicate n ["+", "x"])

Overall the results match what I have seen before (just from running time on some parsers). It would be interesting to see if there's any low-hanging fruit to improve the timings. Though we also have to remember that a library like Earley that gives all possible parses does more work than libraries like Parsec because it has to try every possible next symbol at every position in the input.

phadej commented 9 years ago

I changed the benchmarks a bit, so the token list is generated outside. For some reason parsec is now much faster?

I also added tree branching expression benchmarks.

screen shot 2015-05-07 at 11 41 03

ollef commented 9 years ago

What's the difference between earleyTree and parsecTree? I suspect something went wrong with this commit or you forgot something. :)

I think the better performance is expected if most of the time that Parsec used before was to construct the input (though Earley should also get the same absolute time improvement).

phadej commented 9 years ago

TIL: http://stackoverflow.com/questions/8379191/forcing-evaluation-of-function-input-before-benchmarking-in-criterion

will push updated version of benchmark soon

phadej commented 9 years ago

TIL: http://stackoverflow.com/questions/8379191/forcing-evaluation-of-function-input-before-benchmarking-in-criterion

will push updated version of benchmark soon

phadej commented 9 years ago

The latest version: screen shot 2015-05-07 at 14 15 51

Both seem to be O(n) where n is number of tokens; yet the constant is a bit different :)

ollef commented 9 years ago

Yeah, there's still some work to be done on shaving some of that constant. You can regain a bit of performance by rewriting the grammar as I did above, though ideally you shouldn't have to do that. I'm thinking there might be a way to have a special cased 'fast path' version of the code for when the parse is unambiguous, but I'll have to investigate.

Improving performance should be much easier now that we have proper benchmarks and a test suite. :)

phadej commented 9 years ago

:+1: