Rules for well-formed parsers

Kha commented 3 years ago

With Parser and ParserDescr/syntax, we have two levels of abstraction for defining parsers. In both cases, however, it is possible to define parsers that break as-of-yet unwritten rules, confusing meta programs such as quotations and the pretty printer. We should define these rules, ideally enforce them at least for syntax by construction, and possibly lint them for Parser.

Abstract rule

The abstract rule is that the structure of the parser output should uniquely determine the parser call graph/grammar derivation tree that produced it. By "structure" (of a Syntax tree) we mean the tree modulo atoms, i.e. exactly what is considered when matching against quotations. An example of a parser breaking this rule is "foo" >> ident <|> "bar" >> term: ignoring atoms, we cannot know which alternative accepted the input bar x. Note that ident <|> term itself would be unambiguous because <|> is left biased. In theory, term <|> ident would also be acceptable, but we would need to know whether ident is part of the term category (or rather, whether the produced kind is so) to decide this case in practice.

Another counter example is many and other repetition combinators. In many p, if p is of unknown "arity" (# of produced nodes), we don't know which syntax node child belongs to which "sequence element". This was "fixed", but that fix is no good either: if we encounter a null node inside of a many p output, we don't know in general whether many introduced it because of arity > 1 or whether it was produced by p itself. We either have to wrap every sequence element in a node, which would be wasteful, or demand that p is of constant arity 1.

In practice, we should strengthen this abstract rule: it should be possible to efficiently determine the grammar derivation based on a reasonable amount of static information. For example, we might not want to add new metadata that lets us decide whether ident is in term like we would need to above. And ideally we would like to decide <|> by looking at the root kind of the output alone instead of having to dive further into the syntax tree.

Implementation for `syntax`

Based on the above rule, here is a proposal for a conservative approximation for syntax, to be implemented in the translation to ParserDescr:

For each stx subterm, we compute the arity and, for arity 1, the produced kind, if known and unique
- for parser aliases, we specify this metadata at registration time
- categories have arity 1 and an unknown kind
- for references to parser definitions (of type ParserDescr or Parser), we assume the arity is 1 and the kind is the declaration name. This is correct for syntax ... := and ... := leading/trailing_parser ... declarations, but obviously not in general. We could inspect the definition to be sure, except we can't if we want to make effective use of the module system. Alternatively, we could store the information in an environment extension.
In many p and other repetition combinators, we check that p is of arity 1
In p <|> q, we conservatively check that p has a unique produced kind that is not null (since for null we should not assume that there is a unique parser producing it), and that the RHS is of arity 1.
- We might also want to allow $strLit <|> $strLit as a special case. However, this is not a great way to define e.g. Unicode alternatives since it ignores pp.unicode.
- We might allow p of kind null if q also has a kind, and it is not null. This would not work with p <|> q <|> r, however.
In Notation.lean, the only syntax declaration that does not already fulfill these rules should be
```
syntax location := withPosition("at " locationWildcard <|> locationHyp)
```
With the additional null rule mentioned above, it should be acceptable with group("at " locationWildcard).

Implementation for `Parser` & removal of backtracking in the pretty printer

TBD

tydeu commented 3 years ago

@Kha

The abstract rule is that the structure of the parser output should uniquely determine the parser call graph/grammar derivation tree that produced it. By "structure" (of a Syntax tree) we mean the tree modulo atoms, i.e. exactly what is considered when matching against quotations. An example of a parser breaking this rule is "foo" >> ident <|> "bar" >> term: ignoring atoms, we cannot know which alternative accepted the input bar x.

What is conceptually wrong with discrimination based on atoms? That is, wouldn't just be more intuitive and straightforward to resolve this issue by just having quotations match atoms as well? This particular limitation seems very artificial and unnatural to me. It seems very likely that most end-users would write syntax like "foo" >> ident <|> "bar" >> term and expect it to work and be very confused why it doesn't / can't.

Kha commented 3 years ago

What is conceptually wrong with discrimination based on atoms?

Because we want a quotation pattern in a semantic transformation, e.g. a macro, that used fun to still match if the user wrote λ. And avoid the code explosion from checking every single atom as a beneficial side effect.

That is, wouldn't just be more intuitive and straightforward to resolve this issue by just having quotations match atoms as well?

That still wouldn't solve the efficiency part. We don't want arbitrary "lookahead" to be necessary to differentiate between the two sides of an <|>.

Kha commented 3 years ago

In the end, the high-level syntax command is free to adhere to these rules in whatever way it wants. If an <|> doesn't fulfill the above rules, syntax could wrap the LHS with a unique kind automatically.

tydeu commented 3 years ago

Because we want a quotation pattern in a semantic transformation, e.g. a macro, that used fun to still match if the user wrote λ. And avoid the code explosion from checking every single atom as a beneficial side effect.

Could not tokens with multiple different spelling just be parsed as a node with a unique kind rather than different atoms? That doesn't seem like it would add much overhead. In fact, it could even lessen it as nodes come with a precomputed hash for their kind while atoms don't. Also, as I'm sure you know, there already exists specialized parsers like unicodeSymbol to parse tokens with different ASCII and Unicode spellings -- these could be retooled to produce nodes rather than atoms.

To your second point, the fact that quotations don't current verify atoms already causes many headaches when writing macros. While it may be more computationally efficient, ill-formed syntax will often generate panics rather than clean errors when some later part of the code acts in a manner that assumed the syntax was well-formed. As you might imagine. this makes such errors hard to diagnose.

Kha commented 3 years ago

Could not tokens with multiple different spelling just be parsed as a node with a unique kind rather than different atoms?

I can't make sense of this. Syntax is supposed to be a concrete syntax tree, so different inputs ultimately must lead to different atoms. Apart from identifiers, all tokens are stored in atoms, not nodes directly.

ill-formed syntax will often generate panics rather than clean errors when some later part of the code acts in a manner that assumed the syntax was well-formed

I'm not sure checking atoms would have caught the cases where I produced ill-formed syntax any earlier, the structure of the syntax is much easier to get wrong. The real solution here is a strongly-typed syntax type indexed by the syntax kind.

Kha commented 3 years ago

If the syntax quotation elaborator knew the parser that generated each syntax node of the pattern, it could implement special matching behavior for specific parsers such as unicodeSymbol. That would be a major refactoring, so I haven't considered it so far, though it might eventually be necessary for the typed syntax support (if we want syntax category types such as Syntax `term). Alternatively, we could wrap the output of unicodeSymbol in a special syntax kind that tells the quotation elaborator to ignore it during matching; perhaps that's even what you meant. It would introduce overhead in some very common syntax, but maybe not significantly so.

tydeu commented 3 years ago

Could not tokens with multiple different spelling just be parsed as a node with a unique kind rather than different atoms?

I can't make sense of this. Syntax is supposed to be a concrete syntax tree, so different inputs ultimately must lead to different atoms. Apart from identifiers, all tokens are stored in atoms, not nodes directly.

Alternatively, we could wrap the output of unicodeSymbol in a special syntax kind that tells the quotation elaborator to ignore it during matching; perhaps that's even what you meant. It would introduce overhead in some very common syntax, but maybe not significantly so.

Yeah, that is closer to what I meant by that sentence: keywords with multiple different spellings (atoms) could be wrapped in a node with a distinct kind. I don't think this would incur significant overhead, but, then again, I am not an expert on the parser.

tydeu commented 3 years ago

As an aside:

The real solution here is a strongly-typed syntax type indexed by the syntax kind.

I think this would be a wonderful long term goal and wasn't aware that it was already being considered! I can already see many places where it would make macro writing considerable more straightforward. For example, one wouldn't have to keep re-verifying that a node is of an expected kind at each level of a macro.

gebner commented 2 years ago

Because we want a quotation pattern in a semantic transformation, e.g. a macro, that used fun to still match if the user wrote λ.

Note that we only have 7 Unicode/ASCII symbol pairs; if those were the only issue, I think it would be defensible to just remove the duplicate notation. I agree it would be a good idea to add an irrelevant node kind that instructs the pattern matcher to ignore that subtree.

A related gotcha is the pattern `(let $x := $v; $b), it would be nice if that worked as one might expect at first glance.

leanprover / lean4