ianh / owl

A parser generator for visibly pushdown languages.
MIT License
746 stars 21 forks source link

Is the disjointness requirement for call and exit symbols really necessary? #44

Open modulovalue opened 1 year ago

modulovalue commented 1 year ago

According to https://github.com/ianh/owl/issues/43#issuecomment-1814917290, call and return symbols need to be disjoint.

Therefore, https://github.com/ianh/owl/issues/38 is not possible because string literals viewed as a VPL would have an identical call and return symbol, and would not be a VPL.

However, I'm wondering, could it be that this requirement by Rajeev Alur & P. Madhusudan for VPLs is too restrictive?

I don't have an intuitive understanding of the inner workings of owl, and this might be obvious, but @ianh can you think of a counterexample where having identical call and return symbols would break any of the invariants that owl depends on?

What would happen (where would owl break) if owl ignored this requirement, and a grammar like [ '[' ... '[' ] would be accepted and compiled by owl?

e.g.?:

Open Ast

ianh commented 1 year ago

The basic thing you need is, for each input token, to know whether it will cause the automaton to perform a call transition, a return transition, or an internal transition. It's technically possible, given owl's implementation, to have the same token used to perform different types of transition, as long as you can distinguish the different types of transition deterministically without any lookahead.

In this example, it wouldn't be possible to do that -- [[ could perform a call and a return or two calls.

And in any case, supporting stuff like this makes it much more difficult to explain what a valid grammar looks like. It would be similar to "shift-reduce conflicts" in LALR parser generators, where the error has nothing to do with the grammar itself but with some conflict within the automaton construction process.

modulovalue commented 1 year ago

In this example, it wouldn't be possible to do that -- [[ could perform a call and a return or two calls.

So, if I understood you correctly, what about the following grammar, which is rejected by owl:

a = [ '"' (b (',' b)*)? '"' ]
b = [ '{' (a (',' a)*)? '}' ]

An '"' after the initial '"' in a should unambiguously mean that it is a return symbol, so owl might be able to support it, or am I missing something?


Or more generally, consider [ X Y Z ], if the firstset of Y doesn't contain X, wouldn't it then be possible to support X == Z because a Z after an X would always be a return symbol?

ianh commented 1 year ago

It would technically be possible, but I'm not interested in doing it in owl -- I don't want to write error messages that involve terms like "firstset". One of my main motivations for writing owl was frustration with LALR-type shift-reduce errors that make me have to understand how the parser generator actually works internally to diagnose.

modulovalue commented 1 year ago

I completely agree that LALR conflicts are hard to debug.

However, the thing about firstsets is that they are actually relatively easy to understand and explain. One only needs to be able to walk subtrees from left to right to be able to reason about firstsets.

Furthermore, most implementations use a worklist-based approach for calculating them, but if one collects constraints and treats the problem of finding firstsets as a dataflow problem, it becomes possible to precisely output why the error exists in the first place.

Error messages like the following become possible by walking the dataflow graph in reverse:

X and Z can't be identical because Y can start with X:

1. Y can start with X because of:

[ X Y Z ]
    |
    v
    A B C
    |
    v
    G? H I
    |
    v
    X

2. Y can start with X because of:

[ X Y Z ]
    |
    v
    A B C
    |
    v
    G? H I
       |
       v
       X

3. Y can start with X because of:
   ...

The specification does not model a visibly pushdown language.

I think it would be really cool to see owl become more expressive like that and be able to support identical call and return symbols in cases where that is theoretically possible.