melt-umn / copper

An integrated context-aware scanner and parser generator
http://melt.cs.umn.edu/copper
GNU Lesser General Public License v3.0
17 stars 4 forks source link

Multiple starting nonterminals in parser? #23

Open tedinski opened 6 years ago

tedinski commented 6 years ago

How difficult would this be to support on Copper's side?

I have in mind amending Copper specs to support something like:

    <StartSymbol>
      <NonterminalRef id="silver_definition_core_Root" grammar="host" />
    </StartSymbol>
    <AlternativeStarts>
      <AlternativeStart id="parseExpr">
        <NonterminalRef id="silver_definition_core_Expr" grammar="host" />
      </AlternativeStart>
      ...
    </AlternativeStarts>

The result being generating additional methods like parseExpr in the generated parser class, beside the normally generated one for the normal start symbol.

On the Silver side, to generate a spec like this, we could allow a syntax like:

parser parseC :: Root {
  grammar:a;
  grammar:b;
  parser parseExpr :: Expr;
  parser parseDecl :: Decl;
}

And the avoid re-compiling the same grammar over and over to get these extra parsers.

...Would this impact anything like the modular determinism analysis? (My guess is not really, because it's just another state (or few) in the original parser, which are just analyzed normally...)

schwerdf commented 6 years ago

This could be supported fairly easily in, say, an LL(1) parser, but with LALR(1) parsers it is much more difficult. The main problem is that LALR(1) DFAs are very monolithic and altering a single DFA state can cause a butterfly effect. So it might be possible, with a bit of juggling of the lookahead, to parse a sub-grammar rooted at some "alternative" start symbol using the original parser's DFA, but to verify this, an implementation would potentially have to compile the whole parser separately with each alternative start symbol, negating any computational advantage.

krame505 commented 6 years ago

Would it at least be possible to have a way of suppressing the 'useless nonterminal' warnings that show up in situations like this? This is kind of annoying, and currently the only workaround would be significant grammar refactoring.

schwerdf commented 6 years ago

Maybe I am misunderstanding the problem. I thought these alternative start symbols were nonterminals that were actually being used in the "original" grammar. How much overlap is there between the grammars -- are they completely separate, or sharing only terminals, or sharing terminals and nonterminals?

krame505 commented 6 years ago

Yes - in ableC we build several parsers on various nonterminals (Expr_c, Decl_c, etc.) in the concrete syntax as part of a helper library for constructing ASTs. Since all these nonterminals are in the same grammar, the 'lower' parsers (e.g. on Expr_c) see other nonterminals that can't occur as part of an Expr_c, raising useless nonterminal warnings. Currently the only way of getting rid of these warnings would be to move Expr_c and everything under it into a seperate grammar, which we don't want to do.
It would also be nice to somehow speed up the compilation process by not compiling a totally seperate parser in every case, but as you mentioned this seems hard to do.

ericvanwyk commented 6 years ago

In a long-dead version of this, we didn't add new start nonterminals, but added new productions for the start nonterminal:

TranslationUnit :: = 'now-only-parse-Expr_c-fake-terminal' Expr_c | 'now-only-parse-Stmt-c_fake_terminal' Stmt_c ...

What about this instead?

schwerdf commented 6 years ago

Purely from a structural perspective, I think the current approach seems like the best -- the goal is a separate parser for each start symbol, that is exactly what gets generated, and there is no efficiency factor to justify doing it another way.

As for the useless nonterminal warnings, try setting the isWarnUselessNTs flag to false.

tedinski commented 6 years ago

The main problem is that LALR(1) DFAs are very monolithic and altering a single DFA state can cause a butterfly effect. So it might be possible, with a bit of juggling of the lookahead, to parse a sub-grammar rooted at some "alternative" start symbol using the original parser's DFA, but to verify this, an implementation would potentially have to compile the whole parser separately with each alternative start symbol, negating any computational advantage.

Hmm. I'm not quite sure I understand this. I can easily see why having multiple starts would result in a different, probably larger DFA, and might result in new errors/conflicts. But that would be acceptable to anyone who has opted-in to a specific set of multiple starts, wouldn't it?

So I'm uncertain why instead of starting DFA construction with one item set, you couldn't just start DFA construction with e.g. 3 item sets, in my initial example with 2 alternative starts. Is it simply that this would re-use nothing and end up with 3 totally unconnected subgraphs in the DFA?

(If so, I'm not certain why that would necessarily be the case... It seems to me that the state for an item like Term . Expr [lookahead ';'] could coalesce with . Expr [lookahead $], no? Actually, maybe that's my trouble, I might be thinking about this too naively. LALR doesn't actually recognize this kind of redundancy and coalesce these states. But... maybe the consequent states would differ only in lookahead and be shared?)

schwerdf commented 6 years ago

The problem would be when the process does not result in three disjoint DFAs, but a single DFA with three start states.

Having more than one start state breaks some assumptions underlying the construction of an LALR(1) DFA (e.g., the relation between follow and lookahead sets), so could potentially introduce not only new compile-time errors, but erroneous runtime behavior. So we would either need to come up with a proof of correctness for DFAs with more than one start state, or verify their correctness by comparing them with separately compiled DFAs.

tedinski commented 6 years ago

Hi August, I was thinking about this again recently, and I wanted to double check something:

Having more than one start state breaks some assumptions underlying the construction of an LALR(1) DFA (e.g., the relation between follow and lookahead sets)

Is this solely about the LALR(1) method, or would it apply to LR(1) in general? i.e. If I used naive LR(1), would it be fine to have multiple start states then? I believe it was the naive algorithm I had in mind when I thought it wouldn't be an issue originally (a year ago)...

And as a possible follow up question, I assume Copper doesn't have a naive LR(1) parser builder already. How hard do you think it might be for me to add one? e.g. Are there any pervasive assumptions in key data structures that we're doing things a very particular way?

(And if I wrote one, would you merge it? Granted we wouldn't use it for anything for the time being, but I have some possibly dumb ideas for experiments to try in the future... I have this lingering idea about trying to generate code that looks pretty enough to be a hand-written parser.)

schwerdf commented 6 years ago

I think you are right that LR(1) would not have the same lookahead issues as LALR(1) in this application.

Implementation-wise, Copper was set up to support this kind of experimentation and you could certainly implement a new pipeline for building LR(1) DFAs, although the amount of existing code that could be re-used might be limited -- LALR(1) DFAs are built by first constructing an LR(0) DFA and then annotating with lookahead, so the current data structures detach the lookahead from the item sets, which may prove a hindrance when building an LR(1) DFA.

There is also the caveat that unless you are using very small grammars, the size and compilation time of the corresponding LR(1) parsers would quickly become unmanageable, as I discovered 12 years ago when I was still experimenting with GLR algorithms and tried to make Copper "algorithm-agnostic."