ianh / owl

A parser generator for visibly pushdown languages.
MIT License
746 stars 21 forks source link

How would you handle optional trailing semicolons? #4

Closed mqudsi closed 5 months ago

mqudsi commented 6 years ago

This is a really nice project, I am considering porting a shell scripting language to a formal owl definition and an owl-based parser as I really like the design of this project.

However, one blocker for us is that in scripting languages with an optional terminating semicolon, a line break may be used instead of an explicit ;. The hard-coded interpretation of \n as a mere token separator is already a blocking issue here, but even if it weren't, I feel as though it would require a separate class of tokens resting somewhere between whitespace and literal tokens to handle, if I'm not mistaken?

In particular, I am looking to parse syntax like

for f in foo; bar; baz; end

as being equal to

for f in foo
    bar # or bar;
    baz # or baz;
end

This example can work by replacing all linebreaks with semicolons prior to parsing (feasibility/overhead aside) but that doesn't always work, for example:

echo hello |
    cat

as some symbols are allowed to wrap on to the next line without hard-escaping the new line with a backslash, but wouldn't be accepted if it were input as echo hello |; cat as that's two separate statements.

Any ideas?

ianh commented 6 years ago

Thanks for trying owl, and for the clear description of this problem! I'm planning to add support for custom tokens soon (see #2). To support this use case, that ought to include enabling/disabling/customizing whitespace. Then you could parse newlines into their own kind of token and write (newline | ';') whenever you want a line separator.

The tricky part at that point is ignoring them in certain contexts. One way to do it would be by manually adding newline* wherever they can appear:

expr = newline* identifier (newline* '|' newline* identifier)*

…but that's inconvenient, and it would be easy to accidentally introduce ambiguities.

Let me think about this some more. I think a feature like

expr = ident ('|' ident)* .ignore newline

which automatically adds the newline* wherever it needs to go wouldn't be too hard to add.

numist commented 3 years ago

I'm not sure it's so inconvenient as to be worth changing the grammar language, I arrived at the following independently and it works fine while being easy enough to understand (it even allows empty statements!):

.whitespace ' ' '\t'

program = (';'|'\n')* stmt{(';'|'\n')+, 0+} (';'|'\n')*
ianh commented 5 months ago

Closing this as there seems to be a reasonable solution.