phorward / unicc

LALR parser generator targetting C, C++, Python, JavaScript, JSON and XML
MIT License
58 stars 9 forks source link

Multi mode parser #15

Closed mingodad closed 3 years ago

mingodad commented 5 years ago

Hello ! Is it possible to have multi mode parsers with unicc ? For example gmpl https://en.wikibooks.org/wiki/GLPK/GMPL_%28MathProg%29 have two sections: 1- Model : where the declarations happen 2- Data: where data is assigned to declared entities

When we are parsing in the "Model section" statements like this are valid but not in the data section:

set I := {'Seattle', 'San-Diego'};
set I2;
set J; # markets
param b{j in J}; #demand at market j in cases

When we are parsing in the "Data section" statements like this are valid:

set J := New-York Chicago Topeka; #also valid in Model section
#set I3; #declaration is invalid in data section
set I2 := 'Seattle' 'San-Diego';

Generalizing it's like having multiple parsers as sub parsers and have a way to switch between then. Cheers !

phorward commented 5 years ago

Hi @mingodad, thanks for your question. In case that model-definitions and data-definitions are separated into different files, the best would be to implement two separate parsers for each. Multiple, separate grammars cannot be defined in one grammar definition in UniCC, but this feature will become part of UniCCv2, which is under development.

If all is in one file, and sections are also parts of the grammar, section-dependent syntactical cases must be handled in the semantic actions, for example like in this short example:

gmpl$: "data-section" '{' [* pcb->section = DATA; *]
                 defs '}' [* pcb->section = NONE; *]
        | "model-section" '{' [* pcb->section = MODEL; *]
                 defs '}' [* pcb->section = NONE; *]
        ;

defs: "set" ident ":=" '{' values '}' ';'
       | "set" ident ":=" values ';' [* if( pcb->section == DATA )
                                                         error("Direct values list: Not allowed in data section"); *]
       ;

values: values ',' value | value ;

value: ident [* if( pcb->section == DATA )
                           error("ident: Not allowed in data section"); *]
        | string ;

This might be a solution, but not the best, because parts of the input like the values lists are first parsed due to the bottom-up concept of LALR parsing, and then declared as "invalid" relating to the section.

mingodad commented 5 years ago

Thank you for reply ! I've been looking at unicc2 and trying to compile it but I'm getting segfault with:

./gram2c grammars/es6.bnf

And with valgrind:

Grammar has no goal!
Grammar has no goal!
parse.c, 1017: Function called with wrong or incomplete parameters, fix your call!
Parse error

It seems that it is not in a working state right now:

This function is only run internally.
Don't call it if you're unsure ;)... */
pboolean gram_prepare( Grammar* g )
{
...
    if( !g->goal ) ////!!!! Testing for goal before parsing it ?????
    {
        /* No such goal! */
        fprintf( stderr, "Grammar has no goal!\n" );
        RETURN( FALSE );
    }

Maybe I would be interested in collaborate with it, what's your view on https://github.com/tree-sitter/tree-sitter and what advantages do you think unicc/unicc2 have ?

Cheers !

phorward commented 5 years ago

I've been looking at unicc2 and trying to compile it but I'm getting segfault with:

./gram2c grammars/es6.bnf

...

Well, gram2c is just a service program for generating parts of unicc2 by itself. unicc is the parser generator's main executable, and also some kind of "parser interpreter" because it can directly run a parser (compiling into a target language is not implemented, yet). The grammar definition for ECMAScript 6 in es6.bnf is also incomplete and not tested, this is also the reason why you get a segfault there. The other grammars do better. As you can see, unicc2 is currently absolutely not in a productive state, and a very early development version.

Maybe I would be interested in collaborate with it, what's your view on https://github.com/tree-sitter/tree-sitter and what advantages do you think unicc/unicc2 have ?

Right now, I'm totally unsure where to go with unicc2. Unicc2 is mostly just an effort to make the existing unicc more modular and faster, and providing a simpler way to write grammars. The current unicc in this repository, is very stable and can be used for different purposes. But there are many features included into the parser generator which are obsolete, and could be removed in future. My current focus is also going into the direction of creating a new programming language which focuses on parsing and borrows many semantics from awk.

COLM, for example is such a compiler-language, but its not documented well, and in my opinion too much bundled to C++. But I'm also experimenting around with that, it implements a backtracking LR parser.

tree-sitter, on the other hand, is also a very nice library with a GLR-based parser, and in the open-source thinking it would be the best to use this as an establishing base for further projects and languages in this way, especially that its implemented in C and Rust, and I'm also of the opinion that Rust is the more future-oriented system language & replacement for C and C++. That's why I'm also concentrating on this.

Unfortunately, my time to spend into compiler- and parser-related topics is very limited, because this is mostly done as a spare-time job for fun by me. Therefore, I need to think careful about future steps and where do direct to. I really would like to implement this "awk-like" language with the key-aspect of "parsing" as a productive and useful tool for everyday use.

If you want to collaborate, what are your personal favors and topics to dive into?

mingodad commented 5 years ago

Hello Jan ! Thanks again for reply ! I also have some attraction for programming languages and have done some work in a fork of a scripting language https://github.com/mingodad/squilu and modified lua/luajot to have a syntax more like C/C++/Javascript https://github.com/mingodad/ljs and I would like to have a parser/lexer library included on a scripting language to do the heavy lifting and use the scripting to drive it maybe something like https://github.com/jeffreykegler/libmarpa or https://github.com/rurban/gazelle also https://github.com/SAP/chevrotain has an interesting graphic documentation generation, also I'm always looking at projects like your to see what I can learn and if they can be a better fit to what I'm looking for, but so far https://github.com/tree-sitter/tree-sitter seems to have a lot of useful grammars already done. Cheers !

phorward commented 5 years ago

Hi @mingodad, thanks for the many links. Your work on these Lua-based languages is impressive, and I hope you'll find useful applications and a good user-base for it.

Indeed, tree-sitter as well as marpa are highly versatile parsing systems. It should be considered to make these tools become the engine of such a language or a library you are considering of. As tree-sitter uses GLR as its underlying parsing algorithm, this is looks much more convenient to me, because it follows well-known concepts. On the other hand, the incremental features of tree-sitter are not necessary for my requirements. Earley parsing is also a good choice, in case there's an importance on supporting any grammar, even context-dependent ones. But here, you come into the cleavage between having a parse forest with multiple possible parses, but whats required is one distinct route through the grammar.

I'm quite unsure right now what would be the best option. Writing the parser system on you own gets you the ability to understand every single part of it, how to use it, and where to find its advantages and disadvantages.

mingodad commented 5 years ago

Thank you again for reply ! And I agree with you on :

"Writing the parser system on your own gets you the ability to understand every single part of it, how to use it, and where to find its advantages and disadvantages."

I'm doing turns around it but didn't managed to do it yet, and I appreciate what you did and your willingness to share your experience.

Cheers !

phorward commented 3 years ago

Will close this now. UniCC will be abandoned.