Construct a Lark grammar from ABNF format (RFC 5234)

lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

MIT License

4.87k stars 413 forks source link

Construct a Lark grammar from ABNF format (RFC 5234) #318

Open Zac-HD opened 5 years ago

Zac-HD commented 5 years ago

RFC 5234 describes the standard grammar format for internet standards, such as the notoriously-hard-to-validate email addresses.

Because this standard is a dialect of EBNF and does not allow for embedded code, it should be relatively easy to construct a Lark object for a given ABNF grammar - at least easier than converting from Nearley! Hopefully it's easy enough that runtime conversion in a new Lark.from_abnf method (or group of methods) would be practical.

This feature request is based on HypothesisWorks/hypothesis#170, where I eventually realized that parsing ABNF was going to be easier as well as more widely useful upstream. I'd be happy to work on this with some guidance about where to start, and have already translated the grammar of ABNF from ABNF to Lark's format.

erezsh commented 5 years ago

I think it's a nice idea. I think your best bet is to create a Lark parser that reproduces the output of the GrammarLoader parser : https://github.com/lark-parser/lark/blob/master/lark/load_grammar.py#L663

You might have to do some post-processing (for example, in a Transformer) to make them a perfect fit.

Once you have that working, I'll add an interface to plug it in there when called with ABNF grammar.

While using Lark.from_abnf isn't bad (and can be complemented with Lark.open_abnf), how about doing this instead?

parser = Lark("... grammar ... ", syntax='abnf')

That opens the door to adding other formats in the future.

davaya commented 5 years ago

The output of GrammarLoader is an instance of the Grammar class with rule_defs, term_defs and ignore variables. Is there a "GrammarSave" function that takes a Grammar instance and produces a Lark file or returns a string?

It will be necessary to convert an ABNF file to a Lark file, because there are inevitably features of Lark that are not supported in ABNF. The workflow would be 1) convert ABNF to a Lark file, then 2) tweak the Lark file to achieve the desired results. The GrammarSave function is needed both to develop an ABNF loader in the first place and to do grammar optimization once it is available.

erezsh commented 5 years ago

@davaya

I don't understand what the difficulty is, and what you're trying to do to solve it.

convert an ABNF file to a Lark file

Why does it have to go through files?

there are inevitably features of Lark that are not supported in ABNF

So give Lark a default (like an empty list, or whatever if appropriate)

davaya commented 5 years ago

If there is not a two-way lossless conversion ABNF <-> Lark, then something is lost in translation. Some feature of Lark, e.g., tree shaping, simply cannot be expressed at all in ABNF. If a developer wishes to use that feature, then supporting ABNF in GrammarLoader is not a complete solution.

Instead, the developer will take ABNF as a starting point. Reading an ABNF in GrammarLoader, then saving the grammar to a Lark file, allows the developer edit that file to add features to the grammar. If the updated grammar is saved in ABNF format, those features will be lost.

Conformance to an ABNF specification is validated based solely on the ABNF. But there can be multiple implementations of an ABNF specification, some cleaner than others. Any feature that changes the AST to make it easier to use, but does not change the data on the wire, is a reason to be able to save an ABNF grammar in Lark format.

erezsh commented 5 years ago

simply cannot be expressed at all in ABNF

In that situation, it's common to add new language features that don't break the old one. For example a new operator, that works in lark and not ABNF, but you don't have to use it.

saving the grammar to a Lark file, allows the developer edit that file to add features to the grammar

That seems a bit cumbersome. Why not just a have an translator from Extended-ABNF into ABNF? It should be fairly easy, just removing and canonizing some nodes, and then writing it back. It's simple enough that Lark's reconstructor might even be able to handle it.