ianh / owl

A parser generator for visibly pushdown languages.
MIT License
746 stars 21 forks source link

Identical "bracket" symbols not supported? #43

Closed modulovalue closed 1 year ago

modulovalue commented 1 year ago

Consider the following grammar:

sl = [ "'" "a"* "'" ]

which is meant to model a single quote single line string ('', 'aaa', ...).

My mental model of how it would work with VPLs is the following:

sl = [ "'" "a"* "'" ]
^      ^        |  |     
|----- |        |  |
|      |        |  |
'-> state 0     |  '-> state 0 again
       |        |
       |        |
       |        '-> we see a "'", we pop to state 0
       |
       '-> state 1 (any "'" will pop to state 0)

- we start in state 0.
- we switch to state 1 on an "'".
- we do our regular parsing in state 1.
- we see a "'" in state 1, so we pop state 1 and we go back to state 0.

The goal here is to support recursive definitions like the following and, as expected, the following works:

sl = [ "<" ("a" | [ "{" sl "}" ] )* ">" ]

However, if the "bracket symbols" are the same, it doesn't work anymore:

sl = [ "'" ("a" | [ "{" sl "}" ] )* "'" ]

error: token ''' can't be used as both a start and an end 
keyword

  sl = [ "'" ("a" | [ "{" sl "}" ] )* "'" ]
         ~~~                          ~~~  

I'm wondering, is my mental model of VPLs wrong, and having identical "start" and "end" symbols is simply not possible, or is this a limitation of owl and not a limitation of VPLs?

ianh commented 1 year ago

VPAs have disjoint sets of call and return symbols, so in order to be a VPL, the symbols of a language have to be able to be partitioned into call/return/internal symbols. Here's the relevant part from http://madhu.cs.illinois.edu/stoc04.pdf:

A pushdown alphabet is a tuple ~Σ = ⟨Σc,Σr,Σint⟩ that comprises three disjoint finite alphabets—Σc is a finite set of calls, Σr is a finite set of returns and Σint is a finite set of internal actions. For any such ~Σ, let Σ = Σc ∪ Σr ∪ Σint . We define pushdown automata over ~Σ. Intuitively, the pushdown automaton is restricted such that it pushes onto the stack only when it reads a call, it pops the stack only at returns, and does not use the stack when it reads internal actions. The input hence controls the kind of operations permissible on the stack—however, there is no restriction on the symbols that can be pushed or popped.

modulovalue commented 1 year ago

Bad news :(

So my hypothesis that the lexical structure of Dart can be expressed as a VPL (according to Rajeev Alur & P. Madhusudan) is wrong.

Thank you for your help @ianh.

ianh commented 1 year ago

No problem, thank you for your interest!

ianh commented 1 year ago

Though I should also say, strings are typically parsed using regular expressions -- there's no requirement that the start and the end of the string be call/return symbols in a VPA. It may still be true that the lexical structure of Dart can be written as a VPL.

modulovalue commented 1 year ago

@ianh strings in Dart can contain other strings via interpolated expressions, e.g.:

final a = "a${"b${"c..."}"}";

Which I don't think is regular, but context free (or maybe something between VPLs and CFLs...)

ianh commented 1 year ago

Ah, that makes sense. In this case the { and } could work as call/return symbols (which they probably would be anyway due to blocks of statements and so on), but strings like "{" would still be a problem due to the unmatched call.