mozilla-spidermonkey / jsparagus

Experimental JS parser-generator project.
Other
448 stars 20 forks source link

Switch to a different parser type to support shift-actions #54

Closed nbp closed 4 years ago

nbp commented 4 years ago

Currently we are using a parser generator, which generates a LR parser.

While discussing with @arai-a over IRC and while implementing a potential fix for #48 based on adding code using Reduce computation, I realized that we do want to have code to be executed while shifting.

Having code executed while shifting can be done in an LR parser by adding empty rules (and removing the filter for empty rules which already exists), and adding production rules which are reduced in place where the code needs to be executed. However, doing that for the Return flag ( #48 ) already cause shift-reduce issues. From what I understand @arai-a faced a similar issue while trying to check for duplicated let/const bindings ( https://github.com/mozilla-spidermonkey/jsparagus/compare/master...arai-a:dup-early-error ), which got worked around by renaming the brackets or keywords with a non-terminal production where the code needs to be added.

JavaScript language is designed with LL parsers in mind, thus implementing shift-actions based on the context should be doable, apart for the arguments of lambdas which has a dedicated cover grammar. In this case the context would be similar to adding grammar flags when providing top-down context, or left-to-right attributes as expressed in the ECMA Script spec. ( #44 ).

Reduce-action as present in LR parser might as well be expressed as actions too, in a similar way as an LL parser will unwind its recursive decent to build the AST.

Note, shift-actions do not have to be executed at the time a shift occurs, only in the same order as the shift occured, unless the shift-state depends on the shift-actions (such as grammar flags checking). This property can be useful for handling ambiguous shift-shift issues which have different actions. For example:

A ::= B . C D => do_b();
A ::= B C . D => do_c();

can be converted to:

A ::= B C . D => { do_b(); do_c(); }

if there is no shift-action for reducing C.

One thing to note, is that we could write these rules as extensions to the existing grammar. Actions would be executed as-if we were at the . location, and we could have multiple . per production. Each extension (file) could have its own Rust trait, which would simplify the way the grammar is handled as opposed to what we have today.

Today we have a single file which is used to express the grammar and the reduce action. Each reduce action is a call expression, but these call expressions are not well separated in terms of ECMA specification. Thus each function being called has to handle various type of checks.

Conflicts such as shift-reduce and reduce-reduce issues present in LR parsers no longer need to exists as we can delay the execution of the actions, which implies that we could produce an ambiguous shift-reduce state and a reduce-reduce state, and the action to be executed for the reduced cases would be executed later on, once the ambiguities no longer exists.

However, actions placements would have to be careful to avoid complexity for automatically handling left-recursive decent issues, such as:

A ::= . A `+` B => { count_pre_a(); }

While such action placement can be handled in costly way:

A ::= RecA . => { for rec_a.stack() { count_pre_a(); […] } }
RecA ::= RecA `+` B . => { rec_a.push([…]) }

I think we can try to avoid these without adding too much complexity.

How do we generate such shift-action tables? I think we have multiple ways, the naïve way would be to do a recursive decent in the grammar and generate the corresponding automaton, and add special cases for left recursions. A non-naïve way would be take an LR parser generator and remove shift-reduce and reduce-reduce issues by adding context. Adding context is done by inlining these LR-ambiguities in the production which are using them.

Another advantage of having shift-actions, is that we can easily implement the rejection of rules based on the look-ahead with sequences. As providing context is equivalent to inlining the automaton which is matching the corresponding rule.

jorendorff commented 4 years ago

JavaScript language is designed with LL parsers in mind

Not really. It's designed with handcoded recursive-descent parsers in mind, and those and LL are both top-down parsers. But the JS grammar is not close to being LL(k), even after left-factoring. Probably every use of LeftHandSideExpression causes a FIRST/FIRST conflict.

Still, this is a terminological objection only.

One thing to note, is that we could write these rules as extensions to the existing grammar.

I like this idea.

[...] can be converted to: A ::= B C . D => { do_b(); do_c(); } if there is no shift-action for reducing C.

I believe the requirement is stronger: no production encountered while parsing C may have any actions. That seems very unlikely. The optimization opportunities are better if we can somehow know when pairs of actions commute.

Another requirement is that do_b() must not be allowed to access any internal parser state, such as the last token parsed or the current position. That's an attractive rule, but it might prove inconvenient.

Conflicts such as shift-reduce and reduce-reduce issues present in LR parsers no longer need to exists as we can delay the execution of the actions...

Hmm. This may be easier said than done. The LALR parser we have is pretty typical. It's a standard algorithm.

The kind of "set of ambiguous states" approach you're suggesting is definitely possible. In fact, I think most parsing algorithms can be seen in this light. The LR parser generator certainly works like this (see the comments at "LR parsers: Why?" in jsparagus/gen.py). But creating this kind of parser generator means making design decisions about how much runtime state to model, how much to inline, and where to stop. LR parsers stop at reduce actions and never inline anything. This simplifies the implementation. It's very unclear to me what those design decisions should be, if we decided to create a custom design.

jorendorff commented 4 years ago

I think I would be more comfortable with a top-down parser, so I'm not opposed to the idea of switching to a different parser type.

But it means going off the map of how much parser theory I know, and it seems quite risky.

nbp commented 4 years ago

Here is the content of an email which is slightly better than the initial comment:

As you might have read on github [1], I think we need something slightly better than an LR parser to parse JavaScript. The problem is not related to LR parser directly, but to where code can be attached to the parser. We can introduce fake reduce rules, however this either make the grammar impossible to be handled by an LR parser or change the grammar so much that we lose the ability to reflect with the grammar defined in the specification.

My proposal is the following. LR parsers are ill-defined in the sense that they split shift and reduce in 2 separated groups and lack the ability to simply reason about both at the same time. In LR parser, when a reduce is found, the stack is popped N times, to resume at a given state, and the reduced non-terminal is shifted (GOTO).

In LR parsers, shift-shift are not an issue because we are building a state machine which ends with the reduce states. Replacing the reduce state by a node with an empty transition which is doing some action (record the number of elements N which have to be popped) would be equivalent.

However, as mentioned in [1], one of the advantages of having only shift operations (following automaton transitions) is that we are now capable of moving/delaying the actions to the non-ambiguous nodes of the graph. An equivalent modification in terms of grammar hack would be to pull tokens which are following a non-terminal and add these to the productions of the non-terminal, until the grammar is non longer LR-ambiguous.

The advantage of doing so, is that we would be able to express Left-to-Right grammar attribute instead of only being able to express Bottom-Up grammar attributes. Thus we can build a collection of names, the flags for rejecting grammar productions, lookahead rules with more than one lookahead, and all as grammar annotations.

The major reason for being able to express the maximum as attribute grammars, is that we could define these rules in a modular way (defined in multiple files) and maintainable way (grammar actions: code as comment).

One example of implementation would be to express each grammar annotation file as a new trait, and where the code extracted from the grammar is converted to the rust code which is implementing the given trait. Note, while I am writing Left-to-Right attributes, I am actually thinking in terms of parser state which is being mutated, but where the order is evaluated in a left-to-right manner as text is being processed.

The advantage of seeing each grammar annotation as a trait, is that we are free to re-order the grammar annotation, as long as the grammar annotation do not depend on the same trait. This is opposed to LR parser which forces the evaluation to be executed on reduce operations, thus causing LR-ambiguities and restrictions.

This approach that I am describing is a pragmatic approach based on coding concepts. A formal approach of valid left-to-right attributed grammar is described in the following classification [2].

[1] https://github.com/mozilla-spidermonkey/jsparagus/issues/54#issue-545111313 [2] https://www.researchgate.net/publication/226470263_The_hierarchy_of_LR-attributed_grammars

nbp commented 4 years ago

Here is the content of a pad used as examples to explain the previous thinking:

// delayed reduced action. The reduce action is moved after shifting the t_or_nt token which is
// pushed on the stack and sufficient to disambiguate the reduce action.
{
    let reduced_nt = <const>;
    let _state, t_or_nt = self.stack.pop();
    self.reduce_nt_function() /* pop N, maybe build the AST */;
    self.stack.push(self.next_action(self.stack.top(), &mut self));
    self.stack.push(self.next_action(t_or_nt, &mut self));
}
// non-delayed reduce action, as done in an LR parser today:
{
    let reduced_nt = <const>;
    self.reduce_nt_function() /* pop N, maybe build the AST */;
    self.stack.push(self.next_action(self.stack.top(), &mut self));
}

In terms of grammar, the following grammar has LALR-conflicts:

    A ::
        `b` `c` { reduce_A($1, $2) }

    C ::
        `b` `c` { reduce_C($1, $2) }

    B ::
        A `d` `f`
        C `e` `f`
    D ::
        C `d` `f`
        A `e` `f`

  Start ::
    B `b`
    D `d`

But it can be converted into one which does not have the same conflict. By pulling the following terminals/non-terminals of the A non-terminals into A and moving the reduce operation after these new ""lookahead"" tokens.

    A ::
        `b` `c` `d` { replay_token(1, || reduce_A($1, $2)); }
        `b` `c` `e` { replay_token(1, || reduce_A($1, $2)); }

    C ::
        `b` `c` `e` { replay_token(1, || reduce_C($1, $2)); }
        `b` `c` `d` { replay_token(1, || reduce_C($1, $2)); }

    B ::
        A `d` `f`
        C `e` `f`
    D ::
        C `d` `f`
        A `e` `f`

  Start ::
    B `b`
    D `d`

Adding A :: q

    A ::
        `b` `c` `d` { replay_token(1, || reduce_Abc($1, $2)); }
        `b` `c` `e` { replay_token(1, || reduce_Abc($1, $2)); }
        `q` { reduce_Aq($1) }

    C ::
        `b` `c` `e` { replay_token(1, || reduce_C($1, $2)); /* $1 and $2 are the C stack elements */ }
        `b` `c` `d` { replay_token(1, || reduce_C($1, $2)); /* $1 and $2 are the C stack elements */ }

    B ::
        A `d` `f`
        C `e` `f`
    D ::
        C `d` `f`
        A `e` `f`

  Start ::
    B `b`
    D `d`

Looking for existing literature, I found multiple papers which explain how to increase the look-ahead to resolve conflicts, as opposed to the solution proposed here to modify the reduce operation to modify the stack later on. This is conceptually equivalent as long as we are manipulating only terminals.

However, delaying reduce operations, as mentioned above, could also be done after non-terminals when there is no aliasing/dependencies between the executed pieces of code (reduce operations). My blind guess would be that this gives the ability to reduce JS lambda parameter without the cover grammar, as once we have shifted the disambiguating terminals/non-terminal, for the lambda => we are able to reduce all the parameter names in a cascading way by replaying terminals and non-terminals on top of the reduced stack.

In terms of existing literature:

http://cssauh.com/xc/pub/LaneTable_APPLC12.pdf

Describe the Lane Table algorithm for building an LR(1-k) parser.

http://smallcultfollowing.com/babysteps/blog/2017/03/17/the-lane-table-algorithm/

This a blog post attempting to implement the lane table algorithm mentioned above.

https://pdfs.semanticscholar.org/e450/eeebc5b37cdbf4d853a70955f7088984c8a5.pdf

Describe the edge-pushing algorithm based on lane-tracing. (lane table algorithm?)

http://jikes.sourceforge.net/documents/thesis.pdf

This other paper describe that we might do the same with an LALR(k) parser, without having much more state than LR(0). A simple statement summarize the whole algorithm well:

A practical approach for constructing an LALR(k) parser is to first construct the LR(0) machine and then resolve conflicts in inconsistent states […] by computing the lookahead strings for final items in such states.

nbp commented 4 years ago

This issue is currently being worked on in a branch on my fork of jsparagus: https://github.com/mozilla-spidermonkey/jsparagus/compare/master...nbp:redo-parse-table