Discrepancy between terminal and single-match regex terminal

runtimeverification / k

K Framework Tools 7.0

BSD 3-Clause "New" or "Revised" License

430 stars 142 forks source link

Consider the following definitions.

1. With terminal

module LEXICAL
    syntax Foo ::= "foo"
    rule <k> foo => .K ... </k>
endmodule

This works as expected.

2. With regex terminal

Let's change the symbol to a regex terminal.

syntax Foo ::= r"foo"

This defines the same set of valid tokens for Foo as the previous definition. But now there's a parse error:

[Error] Inner Parser: Parse error: unexpected token '=>' following token 'foo'.
        Source(/home/ttoth/git/pyk/lexical.k)
        Location(3,18,3,20)
        3 |         rule <k> foo => .K ... </k>
          .                      ^~
[Error] Compiler: Had 1 parsing errors.

3. Fix potential lexical conflicts

Speculatively, let's change the token to be disjoint from other lexicals in the prelude.

module LEXICAL
    syntax Foo ::= r"@foo"
    rule <k> @foo => .K ... </k>
endmodule

Now the error is something different:

[Error] Compiler: Expected format attribute on production with regular
expression terminal. Did you forget the 'token' attribute?
        Source(/home/ttoth/git/pyk/lexical.k)
        Location(2,20,2,27)
        2 |         syntax Foo ::= r"@foo"
          .                        ^~~~~~~
[Error] Compiler: Had 1 structural errors.

4. Add the `format` attribute

After adding the format attribute, kompilation works.

syntax Foo ::= r"@foo" [format(@foo)]

Questions

This raises a few questions.

Why doesn't (2) work if (1) does?
Why does the kompiler ask for the format attribute for (3) if for (2) it does not?

I dug into this.

This defines the same set of valid tokens for Foo as the previous definition. But now there's a parse error:

I think you identified the reason here, but for concreteness the issue is a collision with the regex for #LowerId defined in kast.md.

Why doesn't (2) work if (1) does?

Regex terminals have a lower priority than non-regex terminals. The logic isn't especially obvious, but when we generate scanners there's a sorting process that happens here:

https://github.com/runtimeverification/k/blob/67ef71b5c525b1f7971101f4b28f299dc3bb7d79/kernel/src/main/java/org/kframework/parser/inner/kernel/Scanner.java#L153-L156

https://github.com/runtimeverification/k/blob/67ef71b5c525b1f7971101f4b28f299dc3bb7d79/kore/src/main/scala/org/kframework/definition/outer.scala#L969-L992

This ordering places regexes at the end of the list, which means that if any non-regex matches first then the overlapping regex won't get considered. It also answers the question of why the #LowerId regex gets tried (and succeeds) before the Foo one. If you reverse this ordering, the code in (1) fails because foo is tokenized as a #LowerId.

Why does the kompiler ask for the format attribute for (3) if for (2) it does not?

Because of the tokenization problem, inner parsing hasn't even finished when the error in (2) is emitted. This means that we don't have the full semantic information to validate the attributes. For (3), inner parsing has succeeded and we can validate the attributes.

runtimeverification / k