Introduce stronger lexical rules for identifiers

This follows from discussion in #1062 and discussion with @bugarela and @thpani in person.

Currently, type variables must be lower case and type constants must be upper case, but otherwise we allow any identifier to start with either lower or upper case letters. In #1062 I discuss some of the benefits we could gain in our sum types syntax from introducing this difference across other constructs.

This issue addresses the more general issue that we allow an (imo) unhealthy level of leeway in all of our identifiers.

Two motivating examples

Cryptic identifiers

This spec is completely valid -- it parses and type checks:

module ___::Arst::t {
    type __a::FPRT::__ = { a::B::__::c : int }
    const X::B::C: __a::FPRT::__
    def X::b::d(__::__::__:__a::FPRT::__ ): __a::FPRT::__ = __::__::__
    def f(x: List[x::B]): x::B = head(x)
}

module ex {
    import ___::Arst::t(X::B::C = { a::B::__::c : 2 }) as A::B
    val x = A::B::X::B::C
}

Escape hatch thru exotic operators

There are a number of weird edge cases we allow that would be ruled on at the very outer edge of our language if we enforced a more restrictive grammar. E.g.,

These are legal records

>>> Rec("{a: 1, b: 'foo'}", 3)
{ {a: 1, b: 'foo'}: 3 }
>>> Rec("                 ", 3)
{                  : 3 }
>>> Rec("()()()()()", 3)
{ ()()()()(): 3 }

These operations, that we should be able to exclude based on syntax alone, make it all the way thru our type system to be evaluated by the REPL:

>>> val x = "a"; {a: 1}.field(x)
static analysis error: error: [QNT000] Record field name must be a string expression but is name: x
Generating record constraints for Rec("a", 1),x
Generating constraints for x

val x = "a"; {a: 1}.field(x)
             ^^^^^^^^^^^^^^^

1
>>> Tup().item(100)
static analysis error: error: [QNT000] Operator expects at least one arguments but was given 0
Checking arity for application of Tup
Generating constraints for Tup()

Tup().item(100)
^^^^^

runtime error: error: Out of bounds, nth(99)
Tup().item(100)
^^^^^^^^^^^^^^^

<undefined value>

Problems with the status quo

Types defined in modules starting with a lower-case letter cannot be referenced with qualifiers.

E.g., given

module foo {
  type T = int
}

module bar {
  import foo
  def f(x: foo::T): foo::T = x
}

The reference to foo::T is read as a type variable, since it starts with a lower case letter (and our grammar for identifiers is arguably too permissive).

Specs can have wildly different styles

This irregularity will make it hard to read specs written with different conventions. The motivating example I've given is an extreme case, but I think it shows how wild things can get!

Records can get wild

As indicated above, record fields can have "qualifications" using :: and the exotic operator Rec allows constructing fields from any arbitrary string. This allows constructing records with cannot be accessed from the dot notation _._.

Benefits of stricter lexical rules

Easy syntactic highlighting
Rule out weird constructs at the edge of the program phase
Ensure specs have a more regular form, improving acculturated readability

Proposed rules

1. Use 2 different classes of identifiers in the lexer:

LOW_ID: unqualified IDs starting with a lowercase letter. E.g., foo
CAP_ID: unqualified IDs starting with an uppercase letter. E.g., Bar

This will enable us to have more expressive and precise grammar rules, which we are in fact already assuming in our syntax type variables.

2. Qualified IDs (of the form `Foo::Bar::id`) are moved to grammar rules

qualLowId : (CAP_ID '::')* LOW_ID
qualCapId : (CAP_ID '::')* CAP_ID

Currently this is instead specified in the lexical rule for IDENTIFIER.

https://github.com/informalsystems/quint/blob/3a2fdb7b3dd57c9d8cdcf496da3a3bc48d8b38c3/quint/src/generated/Quint.g4#L252-L254

These lexical rules make us prone to cases like those discussed here: https://stackoverflow.com/a/16859849 The root cause is that the lexical rules for IDENTIFIER and SIMPLE_IDENTIFIER are ambiguous, in that every instance of the latter is also an instance of the former.

3. Record fields must be `LOW_ID`: `{a: 1, b: "foo"}`

Ensures no qualifiers show up in record field labels.
Sets up an evident lexical difference from sum type constructors (if my suggestion to make those case sensitive is accepted).

4. Data constructs must be `qualCapId`: `List(1)`, `Foo::Set("foo")`

This enables readers to see, purely locally, that an instance of a data type is being constructed, rather than some other operator application.
Sets up an evident lexical difference from record fields, which will helpful for clarity of type definitions and case analysis.
Fits with the rules in Rust (also with most ML's and conventions in many OOP )

5. Exotic operators that lift quint values into the type level are removed from the language:

I.e, Rec, field, with, item, and fieldNames are removed from the language. Instead, the current syntax sugar would be the normal form.

This plugs the holes these create in our types system, and remove a source of common confusion by users and buggy behavior in the REPL. It is related to the topic of stronger lexical rules in so far as these operator currently allow users to totally bypass our existing lexical rules and the related constructs in our type system.

6. ~Modules must be `qualCapId`: `module MyMod { ... }` or `module Foo::MyMod`~

On further thought, thanks to @thpani's question, maybe this isn't needed?

7. Operators (including actions) and values must be `qualLowId`: `def op(...) = ...; val x = ...` or `def Foo::op(...) = ...`

This works with (4) to ensure that data construction and operator application are unambiguous.

This also harmonizes with (3), in the sense that modules are a (quite special) case of product type, and reference to functions/values in a module is a form of projection.

This fits with the rules in Rust (also with most ML's and conventions in many OOP languages).

8. Type variables must be `LOW_ID`: `a` or `foo`.

Prevents identifying foo::x as a type variable.

9. State variables and constants are `qualLowId`

This could be seen as a special case of (7) and has the same motivation.

For the issue around type variables, we can actually fix that by requiring type variables to be SIMPLE_IDENITIFIERS, which don't allow qualifications prefixes. Then any qualification prefixes can be recognized as type constants.

That's some weird examples indeed! 😄 A couple of thoughts:

Naming style

In general, I'm in favor of giving the spec author as much choice in naming as possible. In contrast to implementation languages, a spec should remain reasonably in sync with a possible implementation – naming is an important part here. In particular, I think

Specs can have wildly different styles

is a feature, not a bug :) IMO, style should be consistent within a spec, or a family of specs. This can easily be enforced by a formatter/linter, and need not be baked into the language itself.

What's a minimal set of restrictions that resolves the presented issues?

I'm not super clear why each of the proposed modifications is needed, and how each of them relates to the presented shortcomings.

It would help me if the exposition shows which of the propositions addresses what issue, and argues why it needs to be as strict as proposed.

`Rec` constructor, record fields

These are legal records

>>> Rec("{a: 1, b: 'foo'}", 3)
{ {a: 1, b: 'foo'}: 3 }
>>> Rec("                 ", 3)
{                  : 3 }
>>> Rec("()()()()()", 3)
{ ()()()()(): 3 }

Is this just the Rec constructor being too permissive? It looks like the grammar is less permissive:

https://github.com/informalsystems/quint/blob/1f41cdb3d869f77b1512f1163f4bb7d724ec905e/quint/src/generated/Quint.g4#L187-L189

Also, the qualification issue with record field names can be resolved by requiring them to be SIMPLE_IDENTIFIERs, similar as you suggest for type variables.

Stage of error reporting

These operations, that we should be able to exclude based on syntax alone, make it all the way thru our type system to be evaluated by the REPL:

For the examples given here, both are caught in the type checker, right? If so, imo this is the right stage to catch these – we get much better error messages this way, than if we caught them in the parser.

Case-sensitive naming

I'm not clear why we need to enforce case on module, operator and constant names. Could you explain what that solves, that cannot be achieved by requiring field names and type variables to be SIMPLE_IDENITIFIERs?

make some excellent points.

I've expanded the proposed rules to capture the motivations for each one, including removing some suggestions and retracting the one about case-sensitive module IDs (I think this isn't actually needed, given the other suggestions).

I think this addresses three of the topics in your reply, and I offer replies to the other two here:

Naming style

In contrast to implementation languages, a spec should remain reasonably in sync with a possible implementation – naming is an important part here.

This is a great point. Perhaps my suggestions here are too motivated by an implementation-centric view? That said, I wonder if case-matching names isn't just the wrong level of granularity to be trying to enforce name similarity? So long as the names have the same letters and there is a consistent rule for converting them, isn't that sufficient?

On the other hand, if exact lexical fidelity is needed, my proposal could be adapted by adding a lexical rule for quoted identifiers. E.g.,


'My special operator name'(x: int) = { ... }

This would make operator names maximally flexible while still allowing for a set of rules along the lines I've suggested to eliminate ambiguous and (imo) confusing cases.

There is a precedent for this in prolog's quoted atoms.

Stage of error reporting

For the examples given here, both are caught in the type checker, right? If so, imo this is the right stage to catch these – we get much better error messages this way, than if we caught them in the parser.

The are caught in the type checker, but the fact that we make expressing this kind of pathological cases presents a surface area for possible defects that needn't exist at all. And if it didn't exist, we wouldn't have to worry about the simulator moving forward and evaluating ill-typed terms. The latter matters because it makes the simulator not type safe.

Improving our errors is important, and we should handle them in a way that allows good error messages. Hopefully #17 will help with this. But this is an implementation detail imo. We can always implement this as a check that runs in the listener rather than literally in the antlr grammar rules, but IMO these cases should be addressed by making these kinds of expressions ill-formed syntactically, rather than semantically.

I have the same concerns as Thomas regarding too restrictive name patterns (specially 7 and 9) and quality of reported errors. But I'm very surprised by these weird scenarios we are currently supporting, and we should definitely address this problem!

For name style, I'm not sure what is the conclusion here. Is the proposal to require quoted names for every name not following the proposed rules? Or are we relaxing the rules AND allowing quoted names for even more relaxed rules (i.e. with spaces)?

For error reporting, I believe I understand the point, but I'm still skeptical about having error messages as good as the ones we can provide in the type checker. I wouldn't like for us to end up with expected one of '"', ',', ')', found 'x'" for field with a non-literal as its first argument.

Is the proposal to require quoted names for every name not following the proposed rules? Or are we relaxing the rules AND allowing quoted names for even more relaxed rules (i.e. with spaces)?

I'm advocating for the former as a way to accommodate this concern about having identifiers that are as flexible as possible, but if we really think lexical fidelity down to the level of character case is so critical, maybe that would indicate we should at least do the latter? Just an idea there.

For error reporting, I believe I understand the point, but I'm still skeptical about having error messages as good as the ones we can provide in the type checker.

I think we can bear in mind at least 3 different concerns here:

What is the optimal syntax (in terms of being easy to read and reason about).
At what stage are different kinds of errors caught and reported.
Ensuring good error messages.

These will interact, of course, but imo we should be careful not to handle concerns that belong in one category in another. So, e.g., if we have an error in types, we should handle it during type checking. But if we have error in syntax, we should handle it during parsing. In either case, we should ensure good error messages.

We have multiple routes to improving error messages relating to parsing, and if #17 leaves us short, we can add a step that runs during the ToIrVisitor, or on the produced IR to give us full control. We are already doing some checks at this stage, e.g., the one you showed me to handle type variables.

That is, I don't think leaving a syntax error to the type-checker just to get better error messages is needed (and I'd advocate against that!). Type checking should only be concerned with semantics.

I wouldn't like for us to end up with expected one of '"', ',', ')', found 'x'" for field with a non-literal as its first argument.

If we adopt point (5) this will be a non-issue, as this whole class of errors will be impossible to represent :stuck_out_tongue_winking_eye:

If we adopt point (5) this will be a non-issue, as this whole class of errors will be impossible to represent 😜

Yeah, this seems to be the main source of trouble here 😆

informalsystems / quint