Allow multiple actions for tokens in parse tables

leanprover / lean3

Lean Theorem Prover

http://leanprover.github.io/

Apache License 2.0

2.15k stars 217 forks source link

Allow multiple actions for tokens in parse tables #800

Closed dselsam closed 9 years ago

dselsam commented 9 years ago

There may be good reason to not support this feature, but without it there are many idiomatic notations that cannot be used.

import data.matrix
open matrix

variables {A : Type} {m n : nat}

definition row_vector [reducible] (A : Type) (n : nat) := matrix A 1 n
definition get_row [reducible] (M : matrix A m n) (row : fin m) : row_vector A n :=
λ i j, M row j

variables (M : matrix A m n) (row : fin m) (col : fin n)

notation M `[` i `,` j `]` := val M i j
check M[row,col]
notation M `[` i `,` `:` `]` := get_row M i
check M[row,:]
check M[row,col] -- error: invalid expression

The issue is that the action after the , is Expr in the first case and Skip in the second case. Parse tables can only assign a single action to each token:

https://github.com/leanprover/lean/blob/master/src/frontends/lean/parse_table.cpp#L306

and the latter notation declaration overrides the former (without even a warning):

https://github.com/leanprover/lean/blob/master/src/frontends/lean/parse_table.cpp#L394

I haven't thought this through carefully, but at first glance it seems that making m_children a name-multimap, and making the while loop at

https://github.com/leanprover/lean/blob/master/src/frontends/lean/parser.cpp#L1083-L1088

recursive should support the proposed functionality, though perhaps with some performance cost.

Any thoughts?

leodemoura commented 9 years ago

Yes, this is a useful feature that is in the TODO list. We also need it for overloading nary notation. Here is the problematic example.

  import data.list data.examples.vector
  open list vector nat

  variables (A : Type) (a b c : A)

  check [a, b, c]                  -- vector A 3
  check (#list [a, b, c])          -- list A
  check ([a, b, c] : vector A 3)
  check ([a, b, c] : list A)       -- error

We also need this feature to workaround the ugly '{a, b, c} notation. In the current system, we cannot use {a, b, c} because it would conflict with the {x : A | P x} notation.

That being said, we have to define a relaxed notion of "compatible action". Then, the parser would allow us to have multiple compatible actions. We need this notion of "compatibility" to avoid backtracking in the parser.

leodemoura commented 9 years ago

Things are messier than I expected. I'm assuming that a backtracking parser is an unacceptable solution. So, if we support multiple actions, the parser must be able to choose one based on the next token(s).

Note that we can decide whether a token may start an expression or not by simply checking the nud table. So, assume that starts_expr(tk) returns true iff token tk may start an expression.

Here are some cases:

1) Actions Skip and Expr. If starts_expr(next_token), then we use Expr, otherwise we use Skip. This is sufficient for handling M[row, col] and M[row,:] notations. Remark: the approach breaks if the user defines a notation starting with :.

2) Expr and (Exprs, sep, fn). The parser would parse an expression. Then, if the next_token is sep, it uses (Exprs, sep, fn). Otherwise, it would use Expr.
Remark: We are assuming that sep does cannot start an expression. This works for separators such as , and ;. Similarly, we can support (Expr, sep1, f) and (Expr, sep2, g) when the separators are distinct.

3) Binder and Expr, If the next token is not an identifier, then use Expr. If it is an identifier, then consume it. If the next token is :, then use Binder. Otherwise, use Expr. This case is making the assumption that the user did not define notation starting with :. We can enforce that (e.g., sign an error if the user tries to define a notation starting with :). We can also use this approach for handling Binder, Expr and (Exprs, sep, fn).

4) Skip, Binder and Expr. If the next token is not an identifier and does not start an expression, then assume Skip. Otherwise, it is Binder and Expr case.

Another issue is that we cannot disambiguate (Exprs, sep, f) and (Exprs, sep, g) where the only difference is the operation being used to build the nary expression. For example, in my previous message, the nary notation [a, b, c] is overloaded in list and vector, and f and g are list.cons and vector.cons respectively. The only solution I see is to move f and g to the accepting state. So, the action would be just (Exprs, sep). We have to do the same for ScopedExpr

Is this solution satisfactory?

leodemoura commented 9 years ago

Cases 3 and 4 above do not work. The problem is that : is optional. That is, we can write {x | P x} instead of {x : A | P x}.

leodemoura commented 9 years ago

The previous commit addresses the following problem

  import data.list data.examples.vector
  open list vector nat

  variables (A : Type) (a b c : A)

  check [a, b, c]              
  check (#list [a, b, c])    
  check (#vector [a, b, c])    
  check ([a, b, c] : vector A 3)
  check ([a, b, c] : list A)

Now, all of them are accepted.

Tomorrow, I think I can finish the support for items 1 & 2. They are sufficient for supporting M[1,:].

For 3&4, I don't see a good solution. It is a pity because I wish we could write {1, 2, 3} instead of '{1, 2, 3}. We include the ' to avoid the conflict with {a ∈ s | a > 0}, which is a short form for {a : _ ∈ s | a > 0}. We need unbounded look ahead to disambiguate { <id> ∈ <expr> , ...} (a finite set of propositions) and {<id> ∈ <expr> | ... } (a sep).

leodemoura commented 9 years ago

We can now parse the problematic example posted by Daniel. However, the implementation is not as general as I expected. The implementation only supports the following kind of "compatible actions": (Skip, Expr), (Skip, Exprs) and (Skip, ScopedExpr). We can add support for more in the future. Some combinations are problematic. For example, we cannot support (Expr, Exprs) since Exprs (i.e., foldl/foldr) accepts a single expression. Similarly, we cannot support (Exprs, Exprs). To support these cases, we would need to implement a different parsing algorithm and/or support backtracking in the parser.

dselsam commented 9 years ago

I'm assuming that a backtracking parser is an unacceptable solution.

For performance reasons?

I don't think any specific work-around with the current implementation will be the end of the world, but I do fear that they may snowball over time. To the extent that Lean aims to capture the dialect of working mathematicians, it seems fundamental that it support canonical notation.

Any thoughts?

This is somewhat unrelated, but there are other minor restrictions I have noticed, e.g. scoped expressions must appear after a binder, so that map notation { f x | x ∈ X } cannot be used.

leodemoura commented 9 years ago

I'm assuming that a backtracking parser is an unacceptable solution. For performance reasons?

For many reasons:

1- Performance. This is a big issue since we have no control over the kind of notation users will define.

2- The error messages are going to be nastier. It is quite tricky to produce nice error messages for this kind of parser. Some parser generators support this feature (e.g., antlr), but they suggest users should only use it as a last resort. In Lean, the issue is trickier since the user is the one defining the syntax.

3- This is not a minor coding change, and this is not a fun project.

I don't think any specific work-around with the current implementation will be the end of the world, but I do fear that they may snowball over time. To the extent that Lean aims to capture the dialect of working mathematicians, it seems fundamental that it support canonical notation.

I have been thinking about this issue. I doubt the input syntax is a decisive factor for the success of the system. Examples: Latex is one of the most used tools by mathematicians. The cryptic input syntax doesn't seem to be an issue. Computer algebra systems are also heavily used tools, and they all seem to have cryptic/ugly input syntax.

BTW, last month when I was presenting Lean, one person insisted that writing function applications as f a b instead of f(a,b) is a huge mistake, and would hurt Lean's adoption. I really doubt it. We can use the computer algebra systems as a counter-example.

We can always implement pretty printers that produce latex-quality output.

This is somewhat unrelated, but there are other minor restrictions I have noticed, e.g. scoped expressions must appear after a binder, so that map notation { f x | x ∈ X } cannot be used.

Even if we could do it, it would conflict with the sep notation {a ∈ s | a > 0}. So, this feature would only be useful if we move to a backtracking parser.

Another option is to implement "set" like notation using Lua or C++. By set-like notation, I mean:

{a, b, c} finite sets
{a ∈ s | a > 0} separation
{f x | x ∈ X} maps

This would be a function that is invoked whenever the current token is {.

avigad commented 9 years ago

I assume the main concern is that in Lean, a single expression can be quite long, such as an entire proof. Isabelle's parser is extremely flexible -- if I recall correctly, it uses a version of Early's algorithm -- but the proof language is separate from the assertion language, and assertions are enclosed in quotation marks, so that individual expressions are not too large and well delimited. Even so, Isabelle's parser is sometimes noticeably slow.

Does this suggest a possible solution, though? Would it be enough to identify certain "hard delimiters", like { } and [ ], and allow some backtracking between those?

BTW, I started writing this comment before Leo's previous one appeared. Let me emphasize that I don't really mind using a little creativity and ASCII art to work around parser limitations. Better notation would be nice, but it is not a must-have.

leodemoura commented 9 years ago

Isabelle's parser is extremely flexible -- if I recall correctly, it uses a version of Early's algorithm -- but the proof language is separate from the assertion language, and assertions are enclosed in quotation marks, so that individual expressions are not too large and well delimited.

I think Isabelle's current approach requires a lot of support from the editor. The Isabelle theory files look like latex or markdown files. Example: https://github.com/avigad/isabelle/blob/master/Analysis/Central_Limit_Theorem.thy They rely on JEdit to display the theory files in a user-friendly way. Similar tricks are feasible in Emacs. For example, we can tell Emacs to display '{ as {, and have finite sets to be displayed as expected. That is, {a, b, c} instead of '{a, b, c}.

We can also go one step further and implement a structured editor. I'm not interested in implementing this kind of editor, but I will be happy and supportive if someone is.

Does this suggest a possible solution, though? Would it be enough to identify certain "hard delimiters", like { } and [ ], and allow some backtracking between those?

I'm not excited about this solution. First, the implementation effort is the same. Second, I doubt users would be "happy". They would wonder why some notation declarations can be overloaded arbitrarily (e.g., the ones starting at { and [), and other notation declarations cannot.

I'm not interested in investing time in changing the parser into a backtracking one. Again, I will be happy and supportive if someone is interested in doing it and dealing with all issues that will popup.

BTW, I started writing this comment before Leo's previous one appeared. Let me emphasize that I don't really mind using a little creativity and ASCII art to work around parser limitations. Better notation would be nice, but it is not a must-have.

And, we can use Emacs capabilities to workaround/hide the ASCII art.

dselsam commented 9 years ago

Examples: Latex is one of the most used tools by mathematicians. The cryptic input syntax doesn't seem to be an issue.

Another concern is the uncanny valley -- one could argue that cumbersome yet straightforward/unified input syntax is actually better than cryptic ASCII art.

I am not suggesting this, but one could imagine e.g. generic fold notation:

definition finset.build [foldl] := (finset.empty,finset.insert)
check {finset.build a, b, c}
definition list.build [foldr] := (list.nil,list.cons)
check {list.build a, b, c}

and similarly generic ways of handling filtering and mapping, that are easy to remember and that require the user to provide enough information up front so that the parser does not need to backtrack. I think coming up with an adequate taxonomy of notation might be challenging though.