picoe / Eto.Parse

Recursive descent LL(k) parser for .NET with Fluent API, BNF, EBNF and Gold Grammars
MIT License
148 stars 30 forks source link

Antlr Grammar support #6

Open thargy opened 10 years ago

thargy commented 10 years ago

Just testing a Gold grammar for XML on a large XML file and the performance of the resulting grammar is absolutely terrible. However, parsing the equivalent JSON file (we have the same data in XML & JSON format) results in much better performance (~80x faster). Perhaps this is because the gold grammar is an LALR grammar being translated into an LL(*) grammar.

Do you have a plan to support Antlr style grammars, only Antlr is probably one of the most popular formats (alongside yacc/lex, bison/flex) and, most importantly, there are a huge number of published grammars for Antlr - which are optimised for LL(*) parsers (as that's what Antlr uses).

The syntax appears to be fairly straight forward.

Finally, if we decide to go ahead then I plan on adding an MSBuild target that will allow you to include grammars directly in the project and have them produce the cs code (using your CodeParserWriter) I can set it so that it uses the Build Action on the file, which would make it possible to specify BNF/EBNF/Gold and have them auto generate the equivalent cs file. Loading the grammar fluently has a big performance benefit on start up over loading the files directly so performing the compile time step is something that would be really useful.

I mention this as I'm more than happy to provide the (un)install.ps1 scripts for your nuget, the impact would be non-existent if they don't set anything to build, but would make it really easy for them to generate cs files rather than load grammars at runtime.

cwensley commented 10 years ago

The performance is quite dependent on which parsers have been named. If you clear the names of the rules that you do not care about (e.g. they are a part of a composite rule), it will increase performance dramatically. Also, there are specific purpose parsers that increase performance a huge amount like StringParser, etc.

To get the best performance out of Eto.Parse, writing the grammar using c# is certainly the best way. However, if you can link/send off the gold grammar that you are using, I can take a look at optimizing the gold grammar parser to generate more efficient code, and/or what tweaks can be done to make it faster.

Antlr style grammars was not on the radar, though no one has asked for it till now so it is certainly on the radar and would be a great addition to Eto.Parse.

I would also welcome any contributions like an MSBuild target to generate cs code from various grammars.

thargy commented 10 years ago

Hi,

Adding an XML Grammar and an ANTLR grammar reader would really help make the project more attractive IMHO.

In return, I started adding the build task project and targets (you can see them in my fork at https://github.com/thargy/Eto.Parse). I got quite a way before I noticed that the BNF & EBNF grammars require a starting rule to generate code.

I can specify a build action to determine the grammar type (in fact I've made it really easy for you to add more via the .targets file) but it's hard to grab supplementary info for the grammar.

I need each grammar to implement a consistent ToCode() method and be able to grab the remaining information from the grammar files themselves, or infer them.

The two bits of info needed are the grammar name and the starting rule.

The name can easily be inferred from the filename, and I'm happy to take that route. Of course the generated class name will use this name.

The starting rule can default to the 'first rule', or 'grammar', or you could add a syntax for this in EBNF/BNF (Gold and Antlr already specify the start rule).

Wondered what your thoughts were before I continue?