lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.86k stars 413 forks source link

Lark.save does not work using the earley parser #788

Closed wysiwyng closed 1 year ago

wysiwyng commented 3 years ago

When trying to do call Lark.save(f) on an earley parser, this exception is thrown: AttributeError: 'XEarley' object has no attribute '__serialize_fields__'

To reproduce:

from lark import Lark

grammar = """
    sentence: noun verb noun        -> simple
            | noun verb "like" noun -> comparative
    noun: adj? NOUN
    verb: VERB
    adj: ADJ
    NOUN: "flies" | "bananas" | "fruit"
    VERB: "like" | "flies"
    ADJ: "fruit"
    %import common.WS
    %ignore WS
"""

parser = Lark(grammar, start='sentence', ambiguity='explicit', parser='earley')
with open('saved.pickle', 'wb') as f:
    parser.save(f)

Using lalr saves the parser as expected, but is no option in my use case. From the documentation and various issues (https://github.com/lark-parser/lark/issues/508#issuecomment-596068571, https://github.com/lark-parser/lark/issues/492#issuecomment-596068838) I assumed that parser saving is available regardless of parser type.

MegaIng commented 3 years ago

Nope, not supported.

wysiwyng commented 3 years ago

Is this a planned feature? If not, could you please update the documentation to reflect this?

MegaIng commented 3 years ago

I don't know if it is planned, @erezsh has to answer that, but I don't think so (If I remember correctly, the parser state for earley is quite large). I also can't edit the docs.

ThatXliner commented 3 years ago

If i'm not mistaken, Lark uses their own way of serializing/deserializing Lark objects, right? Requiring the __serialize_fields__ and __serialize_namespace__ attributes, right?

https://github.com/lark-parser/lark/blob/460a221923317299422f14d87901dee1eb5f5fda/lark/utils.py#L43-L50

MegaIng commented 3 years ago

Yes

erezsh commented 3 years ago

Lark.save exists in order to save the parse table the LALR generates, because it's expensive to compute.

Earley doesn't perform any significant precomputation. It's entire parsing process is completely dynamic. So, there is really nothing to save. Even if I make the function "work", it will not store anything useful, and loading will be just as fast as creating a new Lark instance.

erezsh commented 3 years ago

@MegaIng

I also can't edit the docs.

That's just not true. You can submit a PR, just like you often do with code.

MegaIng commented 3 years ago

@erezsh Yes, that is true. But I didn't want to create a PR for a minimal notice.

erezsh commented 3 years ago

Probably the right thing to do here is not to change the docs, but to override XEarley.serialize to produce a nicer exception telling the user they are doing something wrong.

wysiwyng commented 3 years ago

I would prefer if the documentation (where I learned about save in the first place) told me right away that it is not available for the Earley parser. The cache option, which uses this save feature under the hood as far as I see it also has a short notice in the docs telling users that it's for LALR only.

zxteloiv commented 3 years ago

Hello, I've been confronted with a related usage about XEarley parsers on this issue . Due to the time cost I've managed to use multiprocessing to parse the data, a lot of deep SQL queries. In practice, each SQL may cost 2~7 seconds to parse due to the large grammar (like the one borrowed from MySQL Workbench repo).

The problems are follows,

  1. A lark instance is failed to get pickled and sent to other processes (exceeding the pickle recursion limit perhaps due to some attribute like self.parser = self), so the serialization is crucial.
  2. I try to manually wrap Lark with something like below.

    class ParserWrapper:
    def __init__(self, parser):
        self.parser: lark.Lark = parser
    
    def __getstate__(self):
        data, m = self.parser.memo_serialize([lark.lexer.TerminalDef, lark.grammar.Rule])
        return {'data': data, 'memo': m}
    
    def __setstate__(self, state):
        self.parser = lark.Lark.load(state)

    The workaround still fails because the XEarley parser is not supported in this way, leading to an incorrect subscription error.

    import lark
    p = lark.Lark("""start: "a"*""")
    p.save(open('/tmp/x', 'wb'))     # successful
    pp = lark.Lark.load(open('/tmp/x', 'rb'))  # TypeError: 'Parser' object is not subscriptable
  3. For some large grammar file, MySQL.lark.gz, loading is also time-consuming. Creating a new Lark instance at each subprocess may not be a good solution.

So in a summary, the serialization of XEarley parsers, at least for the parsed rules and terminal defs, is a desirable and consistent feature.

Thanks for the reading.

MegaIng commented 3 years ago

@zxteloiv I don't think we considered gigantic Autogenerated Grammars when making these descisions xD. Yeah, we should make this work for earley then. (Also, I don't think there is a good reason why this shouldn't be pickable. I just think your grammar is so big that the default implementation fails)

zxteloiv commented 3 years ago

Oh Yes! :) I also felt cumbersome working with the large grammar. For now I will change my code to reuse the parser of each worker process. Really appreciate your work!

MegaIng commented 3 years ago

@erezsh We did have a regression in #781 (e.g. now saving a earley parser now fails on loading, not on saving)

erezsh commented 3 years ago

Hi @zxteloiv,

  1. Pickle supports recursion, so I don't think that's the case. It's more likely that the grammar is just huge. You can manually set the recursion limit, and perhaps that's the easiest solution.

  2. That's a problem. We should throw a user-friendly error in this case.

the serialization of XEarley parsers ... is a desirable and consistent feature.

No argument there. Someone just has to implement it.

@MegaIng Not sure what it means. I don't think save/load on the Earley parser ever worked.