igordejanovic / parglare

A pure Python LR/GLR parser - http://www.igordejanovic.net/parglare/
MIT License
136 stars 32 forks source link

How far out is Table Caching? #52

Closed rexroni closed 6 years ago

rexroni commented 6 years ago

In the documentation there is mention that it will be possible to not generate the parser on every startup... how far out is that feature?

Even though I had read that in the documentation before I started, I had naively assumed it would still be somehow possible to pickle the state of the parser and reload it on a future run. In fact, it is possible to serialize the parser using the dill module but it doesn't work quite right after I reload it.

igordejanovic commented 6 years ago

It is on my TODO list but that list is huge. I can't tell when I'll start working on that. Any help is greatly appreciated. :)

The idea is to dump the grammar and LR table in some portable format (JSON, Protocol buffers etc.). That would make it possible to share parsers between different parglare implementations (I have plan to work on ports to some other languages).

I haven't played around with pickling yet but I guess with some pickling customization it should be doable. You don't have to pickle the whole parser object, just the tables, as the time you wait for the parser to be constructed is spent in table calculation. If you do make some progress using pickling please let us know.

rexroni commented 6 years ago

I might take a look at it... Could you point me to where in the code the tables are completed?

igordejanovic commented 6 years ago

Sure. This function creates LR table which is an instance of LRTable class. It is called here during Parser construction.

rexroni commented 6 years ago

Ok, I'm not going to pretend I understood all of the internals I was looking at, but it seems like the State objects have the actions built in, where the actions are brought into the create_tables class via the grammar parameter. Is that correct? Or in other words, I see a flow like this:

define grammar and actions -> create parse table, assigning actions to states as you go

But the caching problem would be a very neat and easy problem if it were possible to implement a flow more like this:

define grammar -> create parse table -> define actions -> assign actions to states

So my question for you, which I was not easily able to answer from reading the code, is this: Are you assigning actions to each State so early just for convenience, or is there some property of the actions that is being used to alter how the tables are created? What I mean is, does the output of the parse table change if the actions are changed, or could the actions be defined after creating the parse table and assigned to the State objects in a later step?

(On a side note, I also looked at implementing my language in a couple of other tools, but I really like the flexibility that parglare offers, due to it being scannerless :)

igordejanovic commented 6 years ago

There is an unfortunate overlap of the concept action in parglare. From the user point actions are defined as a part of grammar/language and represent a piece of code called to process recognized part of the input. OTOH in the context of LR tables action is LR action (either SHIFT or REDUCE, or ACCEPT at the end). I guess I should rename Action class in tables.py to LRAction for consistency sake.

From the point of view of table creation algorithm grammar actions are completely irrelevant.

I wrote this table creation code a while ago so I would need some time to recover my understanding. :) The good news is that I plan to do some nice improvements regarding error reporting that would need to touch table creation process so I should get to shape with the code while working on that.

As you can see in the code, LRState keeps actions and goto dictionary. Those are keyed by grammar symbols, terminals for actions and non-terminals for gotos. More or less that is everything parser needs to know to be able to parse. It would start in a predefined state, and based on lookahead execute certain action form the current state actions dictionary. goto dict is consulted when reduction occurs for the parser to know which should be the next state. If you haven't already I suggest to read relevant parts of the Dragon Book. Also, take a look at parglare LR automata visualization where you can see each LR state as a node, with all information about LR items and follow sets. Links represent actions and gotos. Also check out the parsing loop itself and what the parser uses.

What should be taken care of is to reconstruct the whole parser as it was before you must also reconstruct recognizers and user actions as they are inseparable part of the parsing process. Since those are python functions maybe you were having problem with that during pickling/unpickling.

rexroni commented 6 years ago

Alright, I'll try to take a deeper look at this, but its not likely to happen for at least a month. I need parglare to upgrade one of the build tools for my company's codebase, but not as badly as some other things are needed...

igordejanovic commented 6 years ago

@rexroni Sure, take your time. I expect to work on the table generation during next weeks as a part of better error reporting. I'll keep an eye on table caching issue along the way.

igordejanovic commented 6 years ago

Related to #36

igordejanovic commented 6 years ago

First version of table caching is on the master branch. Please see the updates in the HISTORY.md and the docs. Let me know it you have any trouble.

rexroni commented 6 years ago

That's so awesome! Thank you for implementing that feature! I can now run it two times in a row, and grammar.pgt gets generated and read like its supposed to. This is very exciting.

Oddly though, I am noticing different parsing result the first time I run my code (without grammar.pgt) and the second time I run my code (with grammar.pgt). I am open to the possibility that I am a total noob at writing grammars, and it could be my fault... but it seems like they should produce the same results regardless, right? I will attach a sample program demonstrating the issue:

This is the python file example.py:

from parglare import Parser, Grammar
g = Grammar.from_file('grammar.pg')
p = Parser(g)
print(p.parse('dynamic("OS")'))

This is grammar.pg:

PROG: EXPR+ EOF;

///// SIMPLE EXPRESSIONS /////

EXPR: BUILTIN '(' EXPR* ')' {800}
    | 'if' '(' EXPR_PAIR+ EXPR ')' {800}
    | 'switch' '(' EXPR EXPR_PAIR+ EXPR ')' {800}
    | '[' EXPR 'for' KVP+ ']'
    | OOO_PAREN EXPR ')'
    | EXPR F_PAREN EXPR+ KVP* ')' {left, 700}
    | EXPR DOT POST_DOT {left, 700}
    | EXPR '%' EXPR {left, 600}
    | EXPR '^' EXPR {left, 600}
    | EXPR '+' EXPR {left, 500}
    | EXPR '/' EXPR {left, 500}
    | EXPR '\\' EXPR {left, 500}
    // logical operators on string expressions
    // lower precedence than string-string operators,
    // higher than pure logic operators
    | EXPR '==' EXPR {left, 450}
    | EXPR '=~' EXPR {left, 450}
    // logical operators, which can't really be chained with string operators
    // (even if a nonzero-length string can evaluate to "true"):
    // !a.b.c+e.f makes no sense as !(a.b.c)+e.f -> (false)+e.f
    // so they will have lower precedence
    | '!' EXPR {right, 400}
    | EXPR '&&' EXPR {left, 300}
    | EXPR '||' EXPR {left, 250}
    | LITERAL {200}
    | VAR {100};

POST_DOT: MODIFIER '(' EXPR* ')' {1000}
        | VAR {100};

////// COMPOUND EXPRESSIONS //////

KVP: VAR '=' EXPR;
EXPR_PAIR: EXPR ':' EXPR;

///// MODIFIERS /////

MODIFIER: 'strip' | 'lower' | 'pre' | 'post' | 'regex' | 'wrap';

///// BUILTINS /////

BUILTIN: 'chmod' | 'dynamic' | 'table' | 'sha256' | 'cat';

///// LITERALS /////

LITERAL: FUNC | DICT | LIST | STRING | INT | BOOL | PUKE | NULL;
FUNC: '{' VAR+ KVP* '->' EXPR '}';
DICT: '<' KVP* '>';
BOOL: 'true' | 'false';
STRING: DQSTRING | SQSTRING;
LIST: '[' LIST_ELEM* ']';
LIST_ELEM: EXPR | EXPAND | SKIP;
EXPAND: '*' EXPR;

///// TERMINALS /////

terminals
// F_PAREN is a paren that must be a function call
F_PAREN: /(?<=[a-zA-Z_0-9)}])\(/;
// OOO_PAREN is a parent that must not be a function call
OOO_PAREN: /(?<![a-zA-Z_0-9)}])\(/;
VAR: /[a-zA-Z_][a-zA-Z_0-9]*/;
INT: /\d+/;
DOT: '.';
PUKE: 'puke';
NULL: 'null';
SKIP: 'skip';
DQSTRING: /"[^"\\]*(?:\\.[^"\\]*)*"/;
SQSTRING: /'[^'\\]*(?:\\.[^'\\]*)*'/;

If I run rm grammar.pgt && python example.py I get the following output

[[['dynamic', '(', ['"OS"'], ')']], None]

If I then run python example.py again I get:

[[['dynamic', '(', ['"OS"'], [], ')']], None]

The first time through it is correctly evaluating as the first line of the EXPR block, designed to catch builtin functions in my language like "dynamic":

BUILTIN '(' EXPR* ')' {800}

But the second time it is evaluating as the sixth line, designed to catch user-defined functions:

EXPR F_PAREN EXPR+ KVP* ')' {left, 700}

(Note that F_PAREN means "function paren" and indicates a "(" which is not proceeded by a space, which allows for lists of expressions that can be space separated instead of comma separated).

Is my grammar written in a way that it only happens to work without table caching, or is this a bug?

igordejanovic commented 6 years ago

Thanks for taking time to write a detailed report. There should be no difference in parsing between loaded and calculated table. There also should be no grammar that can't work with table caching so this must be a bug. I'll investigate. What is your Python version?

igordejanovic commented 6 years ago

The bug should be fixed now. Here is a regression test based on the code you provided.

rexroni commented 6 years ago

Awesome, it seems to be working great now! Glad I could help.