lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.77k stars 404 forks source link

Performance issues in Earley for highly ambiguous grammars #73

Closed night199uk closed 5 years ago

night199uk commented 6 years ago

Hey,

Still having some odd issues I just can't troubleshoot and workaround with precedence in earley. Precedence just doesn't work the way I expect:

Simple example for parsing a INI style file:

CR:             "\r"
LF:             "\n"
CRLF:           CR LF
_TS:            ("\t" | " ")+
_EOL:          _TS? ( LF | CRLF )
_EQ:           _TS? "=" _TS?

WORD:                  /[A-Za-z0-9_][a-zA-Z0-9_.-]*/
ASCII_STRING:          /[\t !-~]+/
CONFIG_TOKEN:          "CONFIG_TOKEN"

config_sections:     config_section
config_section:    "[Config]" _EOL config_statements*
config_statements: _TS? CONFIG_TOKEN _EQ WORD _EOL

text_line:          ASCII_STRING _EOL
empty_line:        _EOL

config_file: ( config_sections | text_line | empty_line )*

And the test input:

[Config]
    CONFIG_TOKEN = ThisBreaks

Some test here

The output from the tree being:

config_file
  text_line [Config]
  text_line     CONFIG_TOKEN = ThisBreaks
  empty_line
  text_line This is a test

What I'm trying to achieve is for the configuration sections that are understood and specifically configured to be parsed, and for the remaining contents of the file to simply be scanned as text lines. I've tried raising the precedence on config_sections to 2, config_section, config_statements and a bunch of other ways to rejig the rules. I just can't figure out why earley won't parse this.

What I would expect to see (I get this with lalr/contextual):

config_file
  config_sections
    config_section
      config_statements
        CONFIG_TOKEN
        ThisBreaks
  empty_line
  text_line This is a test

Unfortunately I can't use LALR. This is a section of a much larger grammar and that grammar gets Reduce Collisions with LALR.

erezsh commented 6 years ago

Hi,

This seems like a bug in one of my optimizations. Sorry!

Meanwhile, you can use:

Lark( ... , earley__predict_all=True)

To make it work. Using

config_section.2: ...

should be enough to make sure the parser does what you want.

night199uk commented 6 years ago

Okay, this does get the example working - but on a larger input file (~200-300 lines), I can't complete the parse and the performance has gone through the floor.

I see two things: The Derivative Tree grows rapidly in depth. It seems like each line of the file matched by ASCII_STRING creates a new layer in the Derivative tree. This causes: maximum recursion exceeded in earley.py:

                    k = item_key if self.predict_all else item
                    if k in self.predicted:
                        continue

'if k in self.predicted' calls Tree.__eq__ which seems to compare some deep trees. Some timing stats from a failed run (eventually aborted with maximum recursion exceeded), times in msec: 5271.492 complete 5309.626 predict_and_complete 5488.303 advance 5488.371 complete 9452.931 predict_and_complete 21361.026 predict_and_complete 24190.309 predict_and_complete

I'm working to see if I can find a way round this, but I don't really understand the depths of the earley parser yet, just the scanner.

erezsh commented 6 years ago

Okay thanks, I'm going to look into it.

But generally, I recommend to use %ignore instead of explicit whitespace. It might improve your performance considerably.

night199uk commented 6 years ago

Did some more troubleshooting on this; it's to do with ending on a repeater I suspect.

config_section:    "[Config]" _EOL config_statements*

If I add an _EOL to the end of this then the block successfully matches against the empty line in the example and everything works. Unfortunately, my input doesn't necessarily contain a blank line between blocks that I can use as a terminator:

[Config]
  CONFIG_TOKEN = ThisBreaks
[Test]
  SOME_OTHER_BLOCK = Test

So the problem seems to be with the 'config_section' not being pushed to completion (I guess) when something higher in the tree (text_line) matches the next line of the file?

erezsh commented 6 years ago

Can you attach the full grammar and the input file that causes Earley to choke? Thanks

night199uk commented 6 years ago

Unfortunately not :-( Just this test case. I'm getting close to fixing this myself though. The reason the full grammar chokes is it becomes huge, as every line in the file is ambiguous (text_line | something else) so the parse tree explodes. I think it ends up approx n^2 for the number of lines in the parse target.

The real problem in both this example and there is that for some reason we're not detecting the _ambig in column.add.

Because of this, later in column.add, we don't take account of precedence (probably because _ambig is supposed to deal with that earlier), and this ends up discarding one of the two ambiguous trees:

                    k = item_key if self.predict_all else item
                    if k in self.predicted:
                        continue
                    self.predicted.add(k)
                    self.to_predict.append(item)

So in a pinch I think I can get this working by adding precedence logic to this check, but really I think I need to figure out why _ambig is not detecting the ambiguity between e.g. config_statement and text_line and marking the ambiguous trees as such.

night199uk commented 6 years ago

Okay - I got it. I'm just working on a fix now. It's a lot more complex than I could have imagined.

Apologies for the long winded explanation that follows, it'll help me to justify the fix I'm about to submit :-)

If you have two branches of a repeater that match (amiguously), the code in column.add is supposed to figure that out and create an ambiguous tree:

                # XXX Potential bug: What happens if there's ambiguity in an empty rule?
                if item.rule.expansion and item_hash in self.completed:

However, this doesn't work! The problem is the order of operations in predict_and_complete and the fact that NEW Items are generated BEFORE the ambiguity code runs.

        def predict_and_complete(column):
            while True:
                to_predict = {x.expect for x in column.to_predict.get_news()
                              if x.ptr}  # if not part of an already predicted batch
                to_reduce = column.to_reduce.get_news()
                if not (to_predict or to_reduce):
                    break

                for nonterm in to_predict:
                    column.add( predict(nonterm, column) )
                for item in to_reduce:
                    #### Complete generates NEW Items for any repeating tokens that just finished
                    new_items = list(complete(item))
                    if item in new_items:
                        raise ParseError('Infinite recursion detected! (rule %s)' % item.rule)
                    column.add(new_items)

So if branch 1 (text_line) completes first - there will be no collision in column.add and 4 new tokens will be generated with the current parse tree for the repeater:

[
  Item<(4412163856) __anon_plus_1 : __anon_plus_1 * empty_line> Tree(drv, [Tree(drv, [Tree(drv, [Tree(drv, [Token(ASCII_STRING, '[C]'), Token(_EOL, '\n')])]), Tree(drv, [Token(ASCII_STRING, '    C = T'), Token(_EOL, '\n')])])])
  Item<(4412163856) __anon_plus_1 : __anon_plus_1 * config_sections> Tree(drv, [Tree(drv, [Tree(drv, [Tree(drv, [Token(ASCII_STRING, '[C]'), Token(_EOL, '\n')])]), Tree(drv, [Token(ASCII_STRING, '    C = T'), Token(_EOL, '\n')])])])
  Item<(4412163856) __anon_plus_1 : __anon_plus_1 * text_line> Tree(drv, [Tree(drv, [Tree(drv, [Tree(drv, [Token(ASCII_STRING, '[C]'), Token(_EOL, '\n')])]), Tree(drv, [Token(ASCII_STRING, '    C = T'), Token(_EOL, '\n')])])])
  Item<(4412163856) config_file : __anon_plus_1 * > Tree(drv, [Tree(drv, [Tree(drv, [Tree(drv, [Token(ASCII_STRING, '[C]'), Token(_EOL, '\n')])]), Tree(drv, [Token(ASCII_STRING, '    C = T'), Token(_EOL, '\n')])])])
]

Notice there is no _ambig in here. These tokens are added to to_predict in column.add and predict_and_complete continues it's WHILE loop. Next, config_statement completes, triggering completion of config_section, config_sections and finally anon_plus_1 : anon_plus_1 config_section, generating similar new items, with an alternative view of the historical parse.

---- new_items
[
  Item<(4412163856) __anon_plus_1 : __anon_plus_1 * empty_line> Tree(drv, [Tree(drv, [Tree(drv, [Tree(drv, [Token(__ANONSTR_0, '[C]'), Token(_EOL, '\n'), Tree(drv, [Tree(drv, [Token(_TS, '    '), Token(CONFIG_TOKEN, 'C'), Token(_EQ, ' = '), Token(WORD, 'T'), Token(_EOL, '\n')])])])])])])
  Item<(4412163856) __anon_plus_1 : __anon_plus_1 * config_sections> Tree(drv, [Tree(drv, [Tree(drv, [Tree(drv, [Token(__ANONSTR_0, '[C]'), Token(_EOL, '\n'), Tree(drv, [Tree(drv, [Token(_TS, '    '), Token(CONFIG_TOKEN, 'C'), Token(_EQ, ' = '), Token(WORD, 'T'), Token(_EOL, '\n')])])])])])])
  Item<(4412163856) __anon_plus_1 : __anon_plus_1 * text_line> Tree(drv, [Tree(drv, [Tree(drv, [Tree(drv, [Token(__ANONSTR_0, '[C]'), Token(_EOL, '\n'), Tree(drv, [Tree(drv, [Token(_TS, '    '), Token(CONFIG_TOKEN, 'C'), Token(_EQ, ' = '), Token(WORD, 'T'), Token(_EOL, '\n')])])])])])])
  Item<(4412163856) config_file : __anon_plus_1 * > Tree(drv, [Tree(drv, [Tree(drv, [Tree(drv, [Token(__ANONSTR_0, '[C]'), Token(_EOL, '\n'), Tree(drv, [Tree(drv, [Token(_TS, '    '), Token(CONFIG_TOKEN, 'C'), Token(_EQ, ' = '), Token(WORD, 'T'), Token(_EOL, '\n')])])])])])])
]

These tokens never make it to to_predict due to the duplicate suppression in column.add:

                    if k in self.predicted:
                        continue

However, when column.add sees these it finally picks up on the ambiguity and generates an _ambig tree for config_file. However, since both sets of future tokens have already been generated as new tokens, they aren't updated to be ambiguous by column.add (that is only done in the completion pass) and so the parser carries on with the ASCII_STRING only tree - i.e. the first one added to to_predict.

night199uk commented 6 years ago

Did a lot of work on this but I ran out of time to fix this perfectly. The challenge here is that maintaining the tree compatibility.

Note - the current Derivation strategy has some issues with memory consumption due to the umber of list copies (self.tree.children + [tree] in Item.advance) which is why it blows up on large trees. After testing with Derivations removed; a pure Earley parser (the optimisations in Column.add removed); plus quite a few other extraneous list() copies removed), I can easily earley__predict_all in <1s for even very large trees with no problem.

I think the right way to fix this is to move to a SPPF forest, similar to: https://joshuagrams.github.io/pep/ What I don't have time to do right now is code that and retain the current tree return strategy. This would give us similar (probably better performance due to removing all the list copies) performance to current even with earley__predict_all.

erezsh commented 6 years ago

Interesting, so you're saying that if I change advance() to use something like:

self.derivations.add_next( x )

Which will add a reference instead of constructing a new list, it would solve the memory and speed issues?

night199uk commented 6 years ago

I had a facepalm moment looking at this last night; I fully implemented an SPPF forest based on the code here: https://joshuagrams.github.io/pep/ And found it suffered from some similar issues. In the end I finally figured out why this is broken. The first solution I came up with (the hack I provided in the PR) is actually closer than I imagined to the real solution - although I think there is a cleaner way to implement it.

If you look at the code there and some of the other earley parsers out there, the wrinkle is they do this:

add_item(tag, start, end) =
    end.idx[tag][start] || append_item(tag, start, end)

i.e. only create a new item if it doesn't already exist at the end state (Column). In lark at the moment, we always add a new item (which also gives us a new Derivations tree):

    def advance(self):
        return self.__class__(self.rule, self.ptr+1, self.start)

and then we try and let this code in column.add sort it out:

                    if k in self.predicted:
                        continue
                    self.predicted.add(k)

However, this doesn't work because the column.add path for completion of a Item may have updated the new item, and any updates to the Derivations in the copy lost if the item is discarded. So we end up making different Derivations updates to multiple different Items of different branches, only one of which (the first, which has an incomplete view of the tree) is finally kept. That's why the hack to add the latest item we receive (instead of the first item we receive) works.

Visual explanation might make it easier to understand - items generated today:

Column
config_file: drv1                         <--- only this one is kept in the prediction tree today
config_file: drv1, drv2
config_file: drv1, drv2, drv3     <---- the hack means this one is kept in to_predict

What should happen:

Column
config_file: drv1, drv2, drv3

Nearley circumvents this by storing the L/R pointers in each item and only building the data when the symbol finally completes. I have to think and run some tests but I think this would suffer a similar fate in certain circumstances as each Item can still only have one l/r pointer. We'd have to keep all items at a particular column in to_predict and later add all of their L/R pointers to a common derivative tree in predict_and_complete.

Column
config_file: l<-None, r->drv1
config_file: l<-None, r->drv2
config_file: l<-None, r->drv3
erezsh commented 6 years ago

The Nearley implementation has its own issues, both in complexity and practical run-time.

I haven't yet had time to have a thorough re-read of the algorithm and the original research paper (I have to prioritize paid work). Hopefully when I do I'll have some insight into the matter. Meanwhile, keep me updated if you gain a new understanding, or if you can figure out a fix that passes all the tests and doesn't degrade the average run-time by too much.

night199uk commented 6 years ago

Okay - new PR. I'll close the old one.

I fixed the issues properly (structurally). The significant problem was that Derivations were being added to Items during Advance(). The check for Unique items was done later, meaning we discarded some important Derivations. This PR flips that logic; so first we advance items and add them to the current State and then we add Derivations to the (now unique) item, ensuring that all ambiguous Derivations from a particular state are put into the same item.

Once this was fixed I had a lot of issues with new _ambig's that were lost before and how to manage them, so I modified Derivations to have a very rudimentary SPPF implementation. That part can definitely be improved. As the comments say, the ideal is to make Tokens a Derivation and then store a tree of Derivations with left/right pointers which would reduce memory consumption significantly. I'll work on that later though - this is great for my needs now. e.g. Derivation->Derivation->Derivation->Ambiguous->Derivation->Token->Token->Token.

Appreciate any feedback - I've tried to stay as close to your original implementation as possible. Thanks for the great framework.

erezsh commented 6 years ago

I'm glad you're finding Lark useful. And thanks for submitting a PR! However I can't merge it, since many of the tests are failing for it. (See my comment on your pull request)

You can run the tests locally with:

~/lark$  python -m tests

Or:

python setup.py test
erezsh commented 5 years ago

0.7 published