tlaplus / rfcs

RFCs for changes to the TLA+ specification language
MIT License
13 stars 0 forks source link

Proposal to resolve grammar ambiguities #4

Closed ahelwer closed 1 month ago

ahelwer commented 2 years ago

At this point there are three known ambiguities in the TLA+ grammar, where ambiguity is defined as "syntax requiring unreasonable amounts of lookahead to disambiguate". This proposal hopes to resolve these ambiguities in favor of keeping parsing simple, which has the added benefit of not requiring any changes to the current working of the tree-sitter grammar or SANY beyond possibly improved error messages. Philosophically, we might ask whether we want to add complexity to the language specification by defining these as special cases of general rules, or add complexity to TLA+ parsing by requiring these cases be handled according to a straightforward reading of the language spec.

The ambiguities are as follows:

  1. The use of /\ or \/ in nonfix form, as in /\(a, b) (see https://github.com/tlaplus/tlaplus/issues/637 and https://github.com/tlaplus-community/tree-sitter-tlaplus/issues/4)
  2. Infix operators (+), (-), and (/) which conflict with calling an operator with higher-order parameters as in f(+), f(-), and f(/) (see https://github.com/tlaplus/tlaplus/issues/625 and https://github.com/tlaplus-community/tree-sitter-tlaplus/issues/5)
  3. The inclusion of the block comment start token (* as a valid sequence of characters in the language, as in f(*) or (*(a, b)) (see https://github.com/tlaplus/tlaplus/issues/626 and https://github.com/tlaplus-community/tree-sitter-tlaplus/issues/6)

The proposal to disambiguate these is as follows:

  1. Disallow use of /\ and \/ in nonfix form; users can use \land(a, b) or \lor(a, b) in ASCII implementations of TLA+ if they really wish to use these operators in nonfix form.
  2. Treat any contiguous string of characters (+), (-), or (/) as an infix operator symbol; users can use the +, -, and / operators as higher-order parameters by surrounding them with spaces as in f( + ), f( * ), and f( / ).
  3. Treat these character sequences as the beginning of a block comment; users can add spaces to avoid them being parsed as such, as in f( * ) and ( *(a, b)).

The justification for the proposals is as follows:

  1. Since nonfix operators are only useful when referring to operators from other modules, and /\ and \/ are defined in TLA+ builtins so cannot be redefined in other modules, there is no real use for this feature. Allowing their use in nonfix form would considerably complicate the already-complicated logic for parsing conjunction & disjunction lists.
  2. Higher-order parameters are rarely the only parameter to an operator. The cost of disambiguating these tokens at the parser rather than lexical level would be extremely high given this niche application.
  3. Similar justification to (2).

Accepting this proposal will close all issues linked above.

ahelwer commented 1 year ago

Here's an additional ambiguity, involving the -- infix operator:

---- MODULE Test ----
EXTENDS Integers
a -- b == a - b
op == 1 ----(1, 2)
=====================

The -- infix operator is used in infix form, followed without spaces by the -- in nonfix form. Both SANY and the tree-sitter grammar parse this as a hyphen line. I propose this behavior be declared correct and remain unchanged.

ahelwer commented 1 month ago

This has been completed since it's the default way everything work as it currently is, and this RFC itself can serve as documentation of these decisions (although I think there should be a more official way to record RFCs, especially completed ones, than github issues).