Parser refactoring + Hygienic macro system

leanprover / lean3

Lean Theorem Prover

http://leanprover.github.io/

Apache License 2.0

2.15k stars 215 forks source link

Parser refactoring + Hygienic macro system #1674

Open leodemoura opened 7 years ago

leodemoura commented 7 years ago

@kha @gebner: we have been talking about the parser refactoring for a few months. @david-christiansen has many ideas and suggestions. I'm open this RFC to record his ideas and collect suggestions.

David is suggesting an architecture where we would have the following main steps: 1- read : string -> syntax 2- expand : syntax -> pre-expr 3- elaborate : pre-expr -> expr The read procedure would be much simpler than the current lean parser, and it would generate a s-expression like data structure (syntax). It would include precise position information and comments. Here are some links that David sent me: https://www.cs.indiana.edu/~dyb/pubs/LaSC-5-4-pp295-326.pdf, https://www.cs.utah.edu/plt/publications/popl16-f.pdf We would be able to reconstruct the input string faithfully from a syntax object produced by the read procedure. The syntax object will also represent faithfully constructs such as match-exprs, do-blocks, calc-blocks, etc which are currently expanded at parsing time. Names would not be resolved by the read procedure. It is not clear to us whether we should combine the expand and elaborate steps or not. It seems it would be a good idea to have both of them at least initially. In this way, the elaborator code would not be affected. The downside of not combining them is that expansion cannot take type information into account.

Users would be able to define new syntax using Lean. David also suggested that we could even have different read procedures like Racket. Here is a paper describing how this works in Racket: http://users.eecs.northwestern.edu/~stamourv/papers/languages-the-racket-way.pdf

An important aspect of this proposal is that the expansion step is implemented in Lean, which means expanding user notation become far more flexible, and we can define new syntax based on old syntax. In the current implementation, new syntax must be mapped into pre-terms. So, it is non-trivial to add new syntax that extends existing syntax.

This approach is also useful for implementing tools such as formatters and linters using Lean. I'm not sure now it is best time for implementing this change, but I believe it would simplify the current implementation and make it much more flexible.

gebner commented 7 years ago

I think this is a good approach. ASTs that are close to the source code are pretty popular these days. Scala's meta-programming framework also recently switched to it (scala.meta).

read : string -> syntax

What strings does read read (i.e, whole files or only parts)? How do we deal with infix operators or user-defined syntax?

infix ` # ` := list.diff
#eval [1,2,3] # [2,3,4]

/- declare coinductive command (see Sebastian's branch) -/
coinductive foo : Prop
| intro : foo

We could reuse the current approach: read reads more-or-less a command, then we expand, elaborate and execute it to get a new environment and repeat. Reconstructing the input file would work by concatenating the pieces.

I'm just wondering if we could parse a file in one go, and then read a section ... end section as (section ...).

@david-christiansen Is there anything better in the literature? Skimming through the papers that Leo posted, nobody even seems to mention infix operators.

expand : syntax -> pre-expr

I think we also need an expand_command : syntax -> tactic environment.

It is not clear to us whether we should combine the expand and elaborate steps or not.

I agree that initially, it may be easier to just split the parser into read/expand without modifying the elaborator. But is there a fundamental reason why we'd want to keep them separate? Even if we want to expand macros in a separate pass, then I think it would be easier to have a syntax -> syntax expansion.

The syntax object will also represent faithfully constructs such as match-exprs, do-blocks, calc-blocks, etc which are currently expanded at parsing time. Names would not be resolved by the read procedure.

So if I see this correctly, we cannot substitute syntax objects (since we do not want to hardcode the binding structure of match, and also whether a name resolves to a binder, etc.).

Is this data structure what you had in mind?

structure position :=
(line : ℕ) (column : ℕ)

structure range :=
(left : position) (right : position)

structure location :=
(file : string) (range : range)

inductive origin
| file (loc : location)
| synthetic

meta inductive syntax
| num (n : ℕ)
| name (n : name)
| string (s : string)
| quote (e : expr)
| node (o : origin) (kind : name) (children : list syntax)

/-
match x with
| list.cons x xs := x
| _ := 42
end
-/
#eval
let o (l1 c1 l2 c2 : ℕ) := origin.file ⟨"example1.lean", ⟨l1,c1⟩, ⟨l2,c2⟩⟩ in
syntax.node (o 1 0 4 3) `match [
    syntax.node (o 1 6 1 7) `ident [syntax.name `x],
    syntax.node (o 2 2 2 16) `app [
        syntax.node (o 2 2 2 11) `ident [syntax.name `list.cons],
        syntax.node (o 2 12 2 13) `ident [syntax.name `x],
        syntax.node (o 2 14 2 16) `ident [syntax.name `xs]
    ],
    syntax.node (o 2 20 2 21) `ident [syntax.name `x],
    syntax.node (o 3 2 3 3) `placeholder [],
    syntax.node (o 3 7 3 9) `num [syntax.num 42]
]

/-
λ x, x + 2
-/
#eval
let o (l1 c1 l2 c2 : ℕ) := origin.file ⟨"example2.lean", ⟨l1,c1⟩, ⟨l2,c2⟩⟩ in
syntax.node (o 1 0 1 10) `lambda [
    syntax.name `x,
    syntax.name `default,
    syntax.node origin.synthetic `placeholder [],
    syntax.node (o 1 5 1 10) `app [
        syntax.node (o 1 7 1 8) `ident [syntax.name `+],
        syntax.node (o 1 5 1 6) `ident [syntax.name `x],
        syntax.node (o 1 9 1 10) `num [syntax.num 2]
    ]
]

This approach is also useful for implementing tools such as formatters and linters using Lean.

For linters I think it would be good to have a mapping from syntax objects to the elaborated expressions (so that you can infer types, for example). With such a mapping, we could also do auto-completion and other editor functions in a nice way.

gebner commented 7 years ago

The syntax object will also represent faithfully constructs such as match-exprs, do-blocks, calc-blocks, etc which are currently expanded at parsing time.

Notation should probably also remain unexpanded then.

david-christiansen commented 7 years ago

Is there anything better in the literature? Skimming through the papers that Leo posted, nobody even seems to mention infix operators.

Most of the work on hygienic macros has been in the context of languages without infix operators. Dylan has infix operators, though: https://people.csail.mit.edu/jrb/Projects/dexprs.pdf . Fundamentally, though, infix operators are a feature of read, and read is responsible for handling them.

So if I see this correctly, we cannot substitute syntax objects (since we do not want to hardcode the binding structure of match, and also whether a name resolves to a binder, etc.).

Substitution in syntax objects should be possible, but with variables whose binding is a phase earlier. It makes sense to replace metavariables, but not variables, in a syntax object. The macro hygiene system takes care of making sure that the variables in the final expression point to what they would have pointed to in the original syntax.

I think those syntax examples are very close to what we were talking about. It probably makes sense to give them a bit more structure than just a list of subtems, though. For example, the LHS and RHS of a match clause should probably be grouped together. And because Lean has a fairly rich notation system, a single source span is likely not to be enough. Instead, there should be a source span for the expression as well as a list of tokens that are the keywords and delimiters involved in the expression, with source locations for each, as well as comments and the like.

Notation should probably also remain unexpanded then.

I would think that notations would introduce a new operator for forming syntax objects, a new action for expanding them, and an action for pretty printing the notation.

For linters I think it would be good to have a mapping from syntax objects to the elaborated expressions (so that you can infer types, for example). With such a mapping, we could also do auto-completion and other editor functions in a nice way.

The way it works in Racket is pretty nice. Syntax objects have arbitrary key-value maps associated with them, and one of the properties in that is the original, user-written syntax that eventually gave rise to the current object. In Lean, elab could be syntax -> term, and then the term could say "I'm from there!".

david-christiansen commented 7 years ago

I've become pretty convinced that the expansion process makes the most sense as going directly to terms. Imagine expand : syntax -> elab term, where elab is a monad with effects suitable for expansion-time. I see that as a subset of the tactic effects, including things like introducing a new metavariable, associating syntax objects with that metavariable to be expanded later, doing explicit calls to unification, etc. This allows syntax expansion to be type-directed, and keeping only deterministic operations in elab means that expansion can be done incrementally.

gebner commented 7 years ago

Fundamentally, though, infix operators are a feature of read, and read is responsible for handling them.

Maybe infix operators were a bit of red herring. What I was wondering is whether there are any good solutions to user-defined syntax:

infix `#` := ... -- declare on line 1
#check 1 # 2 -- and on line 2 read already needs to parse a different grammar

And because Lean has a fairly rich notation system, a single source span is likely not to be enough.

Ah, originally I thought it would be enough since you can always recover the input tokens by rerunning the scanner. At least that works well enough for scala.meta (on second glance they just discard comments in transformations...).

But that makes modifying (and then pretty-printing) syntax objects awkward. Ideally you could transform syntax trees by just traversing the tree. Say you want to rename a constant c to d, then ideally you'd just map over the syntax tree, and replace each occurence of (ident c) with (ident d), and pretty-print the result. But for that to work we need to be able to print syntax trees including comments, etc., without reference to the original file.

Do you have an idea on how to store that nicely? Maybe an extra case in syntax?

def tokens := string -- we can expand this

meta inductive syntax
| num (n : ℕ)
| name (n : name)
| string (s : string)
| quote (e : expr)
| node (o : origin) (kind : name) (children : list syntax)
-- new:
| source (tks : tokens)

/-
λ x, x + 2
-/
#eval
let o (l1 c1 l2 c2 : ℕ) := origin.file ⟨"example2.lean", ⟨l1,c1⟩, ⟨l2,c2⟩⟩ in
syntax.node (o 1 0 1 10) `lambda [
    syntax.source "λ ",
    syntax.node (o 1 2 1 3) `ident [
        syntax.source "x ",
        syntax.name `x,
    ],
    syntax.name `default,
    syntax.source "",
    syntax.node origin.synthetic `placeholder [],
    syntax.source ", ",
    syntax.node (o 1 5 1 10) `infix [
        syntax.node (o 1 5 1 6) `ident [
            syntax.source "x",
            syntax.name `x
        ],
        syntax.node (o 1 7 1 8) `ident [
            syntax.source " + ",
            syntax.name `+
        ],
        syntax.node (o 1 9 1 10) `num [
            syntax.source "2",
            syntax.num 2
        ]
    ]
]

The interleaving has the advantage that we can always print back syntax trees, even if they have unknown nodes. On the other hand, building them by hand is ugly (since you need to provide the tokens in addition to the trees). So I think this might be the wrong approach.

gebner commented 7 years ago

Syntax objects have arbitrary key-value maps associated with them, and one of the properties in that is the original, user-written syntax that eventually gave rise to the current object.

In Lean, (elaborated) expressions have a tag field (an unsigned integer). I guess we could keep a map/array of syntax trees after elaboration, and have the tags index into this array.

leodemoura commented 7 years ago

In Lean, (elaborated) expressions have a tag field (an unsigned integer). I guess we could keep a map/array of syntax trees after elaboration, and have the tags index into this array.

@gebner I think we should replace m_tag with a proper annotation object. (EDIT: an extensible annotation object). The tag was added in Lean 0.1, when we had completely different ambitions for Lean. At that time, the goal was: a prover backend for dependent type theory. Now, we view Lean as a programming language and a proof assistant. The overhead of having an extra pointer is acceptable. Moreover, it would simply our design. For example, we would be able to remove the pos_info_provider.

leodemoura commented 7 years ago

@david-christiansen Is there anything better in the literature? Skimming through the papers that Leo posted, nobody even seems to mention infix operators.

@gebner David sent me the following additional reference: http://www.ccs.neu.edu/home/dherman/research/papers/dissertation.pdf

leodemoura commented 7 years ago

But that makes modifying (and then pretty-printing) syntax objects awkward. Ideally you could transform syntax trees by just traversing the tree. Say you want to rename a constant c to d, then ideally you'd just map over the syntax tree, and replace each occurence of (ident c) with (ident d), and pretty-print the result. But for that to work we need to be able to print syntax trees including comments, etc., without reference to the original file.

@gebner I think it would be nice to be able to transform syntax objects. That is, implement syntax to syntax transformations. If we want to do this, (I think) we would need explicit binding sites in the syntax object. For example, the notation have h : A, from P, R would be represented by the syntax object

syntax.node `have [(syntax.binder `h [A, P] [R])]

where syntax.binder is a primitive used to encode binding sites. We would also have a de-Bruijn variable construct syntax.var idx. In the node R, h would be represented by syntax.var idx. Remark: I'm suppressing origin, source, etc. Similarly, lambda (x : A), B would be encoded as

syntax.node `lambda [(syntax.binder `x [A] [B])]

I discussed this approach with @david-christiansen, and he pointed out that it complicates the reader. We would need to have something like our m_local_decls. Moreover, for some syntactic constructs (e.g., match-expressions) the binding sites are not explicit. Things can get very complicated with we allow an extensible notion of pattern. Let's call this design "explicit binding sites at syntax" (1). I have a bias towards it because I know how to implement it efficiently, and we would be able to have syntax to syntax transformations.

Another possible design (2) is "no binding sites at syntax". In this design, bindings are resolved at expansion time. That is, at expand : syntax -> elab expr. I have a pretty good idea how to implement this one. However, it is not totally clear how to implement sophisticated syntax to syntax transformations without producing expr as an intermediate step. David suggested the following idiom for transforming s : syntax. First, we expand and obtain e : expr := expand s. e subterms would contain references to s nodes. Then, we would traverse e to figure out the "meaning" of s nodes, and then transform s using this information. I'm not fully convinced this can be done. My main concern is that some of the information may be lost in the translation to e. What do you think?

The main advantage of this design is that read(ers) are much simpler to implement. Binding resolution is implemented once for each kind of syntax node. That is, we don't need to re-implement it for each new reader. Another advantage is that we have more information (e.g., elaborated types) when resolving binding sites.

BTW, I read the paper "Binding as Sets of Scopes" (linked above), and I still don't understand why this approach is better than the locally nameless approach. In the “Related Work” session, the author says “While our work shares certain goals with techniques for representing resolved bindings, such as de Bruijn indices, higher-order abstract syntax (Pfenning and Elliott 1988), and nominal sets (Pitts 2013), those techniques seem to be missing a dimension that is needed to incrementally resolve bindings introduced and manipulated by macros.”. I can see the advantages of design (2) (i.e., "no binding sites at syntax"), but I don't see yet the advantages of "Binding as Sets of Scopes". AFAICT, we can implement design 2 by having a local context that maps (user facing) names to local constants in the elab monad. It would be similar to the m_local_decls field we currently use at parsing time.

leodemoura commented 7 years ago

@kha How do we feel about this change?

leodemoura commented 7 years ago

I think we also need an expand_command : syntax -> tactic environment.

@gebner I missed this one. I agree with you. However, I think we will also need expand : syntax -> elab expr where elab is a different monad that encapsulates the state of our elaborator.

Maybe infix operators were a bit of red herring. What I was wondering is whether there are any good solutions to user-defined syntax:

I think we should separate elaboration for parsing (i.e., read procedure). In a distant future, we may have multiple readers like Racket, and each of them may provide mechanism for user-defined syntax. We can start with a reader based on the Pratt's parser we use in the current parser. Note that most of the complication there is because we didn't want to backtrack in the parser.

leodemoura commented 7 years ago

I'm considering the following approach: 1- reader produces a syntax object similar to the one described by @gebner above. The binding sites are not resolved. So, it is easy to implement different readers. 2- Given an environment, local_context and a syntax object, we resolve binding sites. The result is a syntax object with resolved binding sites. After this step, we would have nodes such as

syntax.node `have [(syntax.binder `h [A, P] [R])]

I'm omitting occurrences, source, etc. The main point here is that binding sites and names have been resolved. 3- The elaborator converts syntax with resolved binding sites into expr.

Note that, we can expand macros and apply other transformations to syntax with resolved binding sites. We handle this kind of object using the locally nameless approach.

@david-christiansen Does this approach make sense?

gebner commented 7 years ago

syntax.node `have [(syntax.binder `h [A, P] [R])]

Why does syntax.binder take A and P as arguments? (I'm concerned that we end up having two different syntax representations for before and after name-resolution.) We could put them into the syntax.node part instead (just like before name-resolution):

syntax.node `have [A,P, syntax.binder `h R]

The explicit binder approach should also work in more interesting cases, such as match:

syntax.node `match [
  A,
  syntax.binder [`x, `y, `z] $ -- or multiple binders?
    syntax.node `equation [
      /- x::y::z::_ -/,
      /- x+3 -/
    ]
]

Just to summarize my thoughts on the binding strategies:

No binding information: Macros will have lots of mk_fresh_name calls. Syntax extensions need to define substitution and free variable operations.
Binding information introduced by the parser: This requires too much parser state for my taste (I think the parser could otherwise be stateless). Syntax extensions don't need to define any extra operation, but need to handle binding during parsing.
Extra name-resolution step: Seems like a reasonable compromise. Syntax extensions need to define the name/binding resolution operation.

gebner commented 7 years ago

David suggested the following idiom for transforming s : syntax. First, we expand and obtain e : expr := expand s. e subterms would contain references to s nodes. Then, we would traverse e to figure out the "meaning" of s nodes, and then transform s using this information. I'm not fully convinced this can be done. My main concern is that some of the information may be lost in the translation to e. What do you think?

This issue is independent of the binding representation that we choose. Just to clarify, I think this idiom is intended to (partly) address queries such as the following:

Given a syntax (sub-)tree, what is its type after elaboration?
Given a syntax (sub-)tree, what is its expression after elaboration?
Given a syntax (sub-)tree, what is the expected type during elaboration?
Given a syntax (sub-)tree, what is the local context during elaboration?
Given a syntax (sub-)tree t and a modified version t' what is the elaborated type/expression of t'?
Given an expression, what was its syntax tree?

Note that these queries are about sub-trees, so you can't just run the elaborator/expand on them. (I'd like answers to all of these queries. For autocompletion, 1 and 3 are important. I don't think these queries would be used much in syntax extensions. I think these queries would be handy for linters and refactoring.)

In David's approach, we could store the syntax tree in the annotation of the expression. Then we could go from expressions to syntax trees. This would allow us to answer queries 1, 2 and 6.

leodemoura commented 7 years ago

We could put them into the syntax.node part instead (just like before name-resolution):

syntax.node have [A,P, syntax.binderh R]

Great.

Extra name-resolution step: Seems like a reasonable compromise. Syntax extensions need to define the name/binding resolution operation.

Great. I prefer this option too. The "No binding information" is the worst one in my point of view. Note that we would still have to perform binding site resolution to be able to implement substitution and free variable operations.

In David's approach, we could store the syntax tree in the annotation of the expression. Then we could go from expressions to syntax trees. This would allow us to answer queries 1, 2 and 6.

I think this is great. It simplifies error reporting, and we may also use the information to improve the pretty printer. For example, last week, @jroesch pointed out that the pretty printer has no support for match-expressions, and this may confuse users when visualizing goals.

Kha commented 7 years ago

(I'm still reading through the papers, there was a four-day bridge weekend here in Germany)

david-christiansen commented 7 years ago

I think it would be helpful to consider, from a high level, a few non-trivial procedural metaprograms (what we call "macros" over in Lisp-land) when thinking about how to do binders in syntax.

Imagine I wanted to implement my own pattern matcher and my own embedding of an OO language. A pattern matcher needs to discover which of the identifiers in patterns are references to things like constructors or directives that change the matching, and which are variables to be bound in the right hand side of the matching clause. An OO language has an identifer like self that is bound inside of instance methods, and may also bring the instance's fields into scope in a method.

In each case, scope resolution is pretty non-trivial. The syntax is going to be processed by a fairly complicated metaprogram, and the binding structure is determined by the binding structure of the resulting core-language program, because the core-language program is the meaning of the expression. Having a name-resolution phase prior to expansion means that a large part of each of these nontrivial metaprograms needs to be written twice, once to resolve bindings, and once to emit the realization of that binding structure in the next step.

If the reader is what resolves names in syntax, then the reader needs to be extended to support self in an embedded OO language, and to know which names are constructors. If there is a pass prior to expansion, then a lot of work needs to be duplicated by metaprograms with interesting binding structure.

I think that @gebner missed a strategy, which is essentially the Scheme/Racket strategy: expanding syntax proceeds out->in, and emits bindings as it goes. The expansion monad tracks the context in which a piece of syntax is being expanded, and uses that to resolve identifiers that are emitted by expansion. So in order to implement a binder, a syntax extension expands to a binder. This become de Bruijn indices when the identifier is finally resolved.

How about this for an API sketch?

The reader (parser) maps input strings to syntax objects.
Syntax objects are essentially s-expression-like, with added metadata like source locations and the scopes in which they participate.
There is an expander monad, which has most of the pure operators from tactic. Keeping it pure lets it do things like look up definitions, but it preserves incrementality in editors by allowing unchanged expansion results to be cached.
expander also has a task queue, and an operation: enqueue : syntax -> expander metavar that creates a new metavariable and schedules it to be solved by expanding the syntax object in it.
Metaprograms have the type expander expr, where expr is the kernel language.

Under this kind of API, syntax can straightforwardly expand into new syntax by enqueueing it and emitting a reference to the resulting metavariable. Additionally, the binding structure can be discovered incrementally as the program is expanded, at which point the surrounding context is available for resolving names, and expansion can probably be parallelized to some extent.

What this doesn't provide is substitution and free-variable analysis without expanding syntax. Much of the utility of substitution can be recovered by emitting a let-binding, or by having a meta-level let-binding that doesn't leave a trace in the resulting program. Otherwise, the answer is to first annotate each identifier in the original syntax object with a unique ID, then expand it, and recover the binding structure from the kernel language expression that results.

leodemoura commented 7 years ago

@david-christiansen Thanks for the feedback. Your design makes sense. My concern is that it doesn't support operations such as expand_macros : syntax -> syntax. That is, we would only be able to expand syntax into expr. One of the papers above says this is an useful operation (e.g., for debugging macros). Is this an issue?

It is also not clear to me whether we will be able to implement robust refactoring tools (i.e., transformations from syntax to syntax). It would be great to have one for renaming declarations. I agree with you that we can inspect the generated expr to find out how names were resolved. My concern here is that information may be lost when translating/expanding into expr. For example, consider the scenario where the generated expr does not contain any sub-expression liking to an identifier in the input syntax. I think this may happen when elaborating complicated declarations. Should we treat this kind of situation as a bug?

Here is an orthogonal issue, one of the papers above stresses the impact of macro expansion on debugging. We will have a similar issue: the impact of macro expansion on proofs. In the paper, they say it is great to preserve as much of the original structure as possible. It is not clear how to do it when proving properties of definitions that make heavy use of macros. Before reading the paper, I was assuming a very naive approach: just expand the macros :) Should we be concerned about this?

david-christiansen commented 7 years ago

Racket's local-expand is certainly highly useful. Not only is it useful for inspecting the output of expansion, it is also useful for inverting control to implement other control structures in metaprograms. There is not a separate expr datatype in the Scheme family, but typically a subset of syntax objects are taken to represent the core forms, and this corresponds to expr in the above.

It is also very useful to be able to single-step expansion, which I think is supported pretty well by giving the current expression and the expansion queue.

If the result of expansion does not have any references to a binding, then by definition there is no reference to worry about, so renaming becomes trivial. What breaks this is when hygiene is bent to introduce new identifiers. For example, defining a datatype introduces new names based on the original one. In Racket, syntax properties are used to communicate this fact to tools. Presumably, an additional constructor would make sense for that as well.

I think that proving properties of definitions that use macros should exactly have the "just expand the macros" semantics. The meaning of any expression can only be the meaning of its elaboration in the kernel language. What is your concern here?

gebner commented 7 years ago

the binding structure is determined by the binding structure of the resulting core-language program, because the core-language program is the meaning of the expression.

I think this is a non-sequitur. We can define a binding structure on syntax trees, and in a fundamental sense I think the user sees it this way as well.

When you have an expression like match x with (y::ys) := y+1 | _ := 42 end, then clearly the y in y+1 is bound. But I don't think anybody would justify this by saying that the match expands to list.cases_on x ... (λ y ys, y+1). I think it is much more natural to argue that it is bound because the name occurs on the left hand side of the match case, i.e. by arguing on the level of the syntax tree instead of the elaborated expression.

Having a name-resolution phase prior to expansion means that a large part of each of these nontrivial metaprograms needs to be written twice, once to resolve bindings, and once to emit the realization of that binding structure in the next step.

This is a good point.

Syntax objects are essentially s-expression-like, with added metadata like source locations and the scopes in which they participate.

The participating scopes is basically a list name, right? ~And everything in a definition is in the same scope?~

-- begin of scope
def foo (x : list nat) : nat :=
match x with (y::ys) := y+1 | z := 42 end
-- end of scope, and there are no other scopes

EDIT: just read the POPL paper, and they give (almost) every subexpression its own scope. Essentially this is a kind of global view of the local context. The whole expression has a map of scopes to names introduced in this scope, and you get the local context at each point by combining the local context of all scopes that are in scope. This seems like a long list of a scopes for each expression.

There is an expander monad, which has most of the pure operators from tactic. Keeping it pure lets it do things like look up definitions, but it preserves incrementality in editors by allowing unchanged expansion results to be cached.

I don't think restricting tactic buys us anything. Caching expansion results is probably less useful than it seems--we can only reuse results if the environment hasn't changed, and most of the time it changes. (There are further technical issues with caching, see #1601.)

expander also has a task queue, and an operation: enqueue : syntax -> expander metavar that creates a new metavariable and schedules it to be solved by expanding the syntax object in it.

There should probably be an extra argument for the expected type:

enqueue (expected_type : expr) (syn : syntax) : expander expr

I guess the local context is handled as state of the expander monad, and saved when calling enqueue.

Much of the utility of substitution can be recovered by emitting a let-binding, or by having a meta-level let-binding that doesn't leave a trace in the resulting program.

Explicit substitutions are a nice approach for macro expansion. But they don't really help if you want to get a syntax tree back to insert it in the editor (ideally refactoring would be able to remove let-bindings).

gebner commented 7 years ago

I think I have an idea on how to deal with both queries 1-6 and the problem of renaming bound variables.

We can add a "tracing" mode to the expander/elaborator, where we store all calls to enqueue and add operation to inspect them (or even by default, depends on performance):

get_local_context : syntax -> expander expander_local_context
get_expected_type : syntax -> expander expr
get_metavar : syntax -> expander expr
-- maybe, not sure how easy/reliable this is
find_source : expr -> expander syntax

Then we can implement variable renaming (fairly) reliably by:

expand the source syntax
Find the identifier node that is at the position in the editor. Obtain its corresponding local constant by get_metavar >>= instantiate_mvars.
Traverse the source syntax, at every identifier node call get_metavar >>= instantiate_mvars and see if it is the local constant that we want to replace. If so, rename the identifier node.

(In a sense, we annotate the source syntax with fresh variable names that satisfy the regularity condition: every binder has a different bound variable name.)

gebner commented 7 years ago

It is also very useful to be able to single-step expansion

Does this work well with the sets-of-scopes approach? If I see it correctly, you could have an eta macro that expands as follows:

-- f : nat -> nat -> nat
λ x, eta (f x)
λ x, λ x, (f x) x

(Each of the x variables in a different scope, one is in the macro's scope, the other one in the definition's lambda's. I hope I got this right.)

Clearly, the expanded version is not the right surface syntax. We also can't directly rename one of the x variables since we don't have binding information (yet). How do we pretty-print the expanded syntax trees?

david-christiansen commented 7 years ago

The method you mention for renaming seems equivalent to the Racket way of doing it, which is to stick a unique syntax property value on each identifier, expand it, and traverse the result with a parser to figure out which unique ID binds which. Storing a reference to the original syntax of each node in the resulting expr does basically the same thing. But this doesn't require instrumenting the expander.

To pretty print the expansion of the eta macro, use the scope sets to discover what refers to what, rename as needed to remove apparent conflicts, and then print the renamed version. Binding information is available post-expansion.

gebner commented 7 years ago

BTW, I read the paper "Binding as Sets of Scopes" (linked above), and I still don't understand why this approach is better than the locally nameless approach.

I think one advantage is that you can specify macros via just rewrite rules, even if they match on binders. In pseudo-Lean syntax:

-- not shown: parsing/printing hooks for `where` syntax, and macro boilerplate.
match syn with
| `````(%%a where %%b) := `````(let %%b in %%a)
end

example := n where n := 5
example := m where ⟨m,n⟩ := (⟨1,2⟩ : ℕ×ℕ)

This specification also works transparently with destructuring assignments. With an explicit name resolution step, we would have to reimplement/reconstruct the bindings that let would create.

Storing a reference to the original syntax of each node in the resulting expr does basically the same thing. But this doesn't require instrumenting the expander.

As Leo remarked, this will not work (reliably). There are syntax objects that do not show up in the final expression (for example identifiers in binders, i.e. the first x in λ x, x+1). Besides, the elaboration/expander state would already contain all of that information, so it's just a matter of storing the elaboration state.

To pretty print, [..] use the scope sets [..]. Binding information is available post-expansion.

I feel like this may be a major problem if we view expansion as a transformation from syntax to expr. The main issue is that we need to execute tactics during expansion/elaboration. Fundamentally, I don't think there is any way to avoid tactic execution--the result of tactics affects the types of variables in the local context and the expected type. And we definitively want type-directed expansion: the ⟨_, _⟩ syntax can expand to any structure constructor (prod.mk, sigma.mk, etc.) depending on the expected type.

Tactic execution is relatively expensive (and potentially non-terminating), so I think we should not require it to print syntax objects.

Further points:

Expanding a primitive binder node (think of lambda or let) in the sets-of-scopes approach requires adding a new scope to all identifiers under that node. In contrast to locally nameless, I don't see any way to optimize this using variable bounds. I don't know if this would be a performance problem.

In Racket, there is presumably a good way to name global scopes (that is, what module you're in, what module imports and top-level definitions are visible, etc.). In Lean, we don't have that (and it would essentially require storing the environment at that point due to open, namespace, section, etc.). (I think storing these environments is a bad idea due to memory usage.) Hence syntax quotes referencing constants would probably be unhygenic. With an extra name resolution step, we could resolve the syntax quotes where they are defined, and this problem would not arise.

section
open nat  -- now bodd resolves to nat.bodd
def my_quote := `````(bodd)
end

-- inserting my_quote now would give an "unknown identifier" error,
-- unless we remember the environment my_quote was declared in.

leodemoura commented 7 years ago

@gebner Thanks for sending the last message. It is very useful.

After I read your message, I started to question the feasibility of syntax -> syntax transformations. Name resolution is also problematic before elaboration. We have aliases that are solved using type information that is only available during elaboration.

import data.stream
open stream
open list
constant a : list nat
constant b : stream nat
#check cons 1 a  -- cons here is list.cons
#check cons 1 b -- cons here is stream.cons

In the current implementation, we use a choice node to encode overloaded names. We would need to use something similar to be able to resolve names before elaboration. So, any syntax -> syntax transformation would have to take these choice nodes into account. BTW, the reader will also have to introduce choice nodes for overloaded notation.

As Leo remarked, this will not work (reliably). There are syntax objects that do not show up in the final expression (for example identifiers in binders, i.e. the first x in λ x, x+1). Besides, the elaboration/expander state would already contain all of that information, so it's just a matter of storing the elaboration state.

I agree, but what do we do then? We both agree this approach will not work reliably. Resolving names before elaboration and overloaded notation also seem to create problems for syntax -> syntax transformations.

BTW, I'm also concerned about printing syntax objects reliably after we apply transformations. Even a simple refactoring tool that just renames declarations will be tricky to work reliably and produce nice results. A rename may introduce an unintended overloading due to aliasing, and the new syntax may fail to be elaborated. The refactoring tool would need a mechanism to decide when a fully qualified name is required or not. Some conservative solutions may be sufficient (e.g., it uses a fully qualified name if the short name is overloaded).

I think one advantage is that you can specify macros via just rewrite rules, even if they match on binders. In pseudo-Lean syntax:

I have a question about this approach. Suppose we define bor a b to be a macro defined as

(if x then x else b) where x := a

Now suppose we want to expand bor at let x := tt in bor ff x, we need to know that where is introducing a binding site. Otherwise, we will get the incorrect result let x := tt in (if x then x else x) where x := ff.

gebner commented 7 years ago

After I read your message, I started to question the feasibility of syntax -> syntax transformations.

So the alternative is to expand everything directly into expressions, without being to use syntax as an intermediate step?

This means e.g. you cannot expand into do-notation or even >>=-applications. You would have to create the fully elaborated expressions directly (including monad type, intermediate result types, the monad instance, etc.) You'd also need to handle the propagation of the expected type manually.

Even if neither the LN nor the sets-of-scopes approach works out, we can always fall back to obscene amounts of mk_fresh_name.

Name resolution is also problematic before elaboration. We have aliases that are solved using type information that is only available during elaboration.

Yeah, we would need choice nodes in resolved syntax. The important part about resolved syntax is that local bindings are resolved, otherwise we can't do any transformations/printing at all.

BTW, the reader will also have to introduce choice nodes for overloaded notation.

My plan was to have (parser) notation as a separate node. That is, basically just store the tokens+expressions in a syntax node, and resolve them during elaboration. For pretty-printing we want the original (notation) structure anyhow.

Resolving names before elaboration and overloaded notation also seem to create problems for syntax -> syntax transformations.

So far, the name-resolution approach seems to be the least problematic IMO. I don't think choice nodes are a huge problem, there is tons of other syntax which resolves to different constants depending on the type (structures, calc, brackets, ...).

A rename may introduce an unintended overloading due to aliasing, and the new syntax may fail to be elaborated.

You mean, like renaming stream.cons to scons and then correctly renaming only half of the cons occurrences to scons? All refactoring tools have limits, everybody expects them to break in complicated situations. I think many refactoring tools don't even handle capture-avoiding renaming. Even IDEs such as IntelliJ will tell you "no, I can't do that" sometimes.

Practically speaking, we can improve the accuracy of such transformations by looking at the resulting expression after elaboration. IMHO the main purpose of the name resolution step should be to separate local bindings from global ones, and determine the local binding structure.

The refactoring tool would need a mechanism to decide when a fully qualified name is required or not.

Essentially, yes. I don't think it will be a big problem in practice, we can remember and restore the original constant name, and pray that it still works.

I think this part of the refactoring tool should be a general function that is also used to implement pp. That is, I think pp should be implemented in a few separate steps:

unexpand : expr -> syntax: leaves out implicit arguments, class instances, tries to introduce notation, do-notation, structures, projections, brackets, etc. Returns resolved syntax.
unresolve : syntax -> syntax: replaces LN-bound variables by ident nodes, in such a way that identifiers do not clash.
print : syntax -> format: turns the unresolved syntax object into a string

The refactoring tool would only use print ∘ unresolve, and that should be reasonably safe in most scenarios.

I have a question about this approach.

The trick is that each occurrence of x has a set of scopes attached, in this sense the two xs are different. Let's trace the calls to expand. After the first step, we invent the new scope 0 for the primitive binder let, add x^{0} to the local context, add the scope 0 to all identifiers in the body, and proceed with the subterm.

let x^{} := tt in bor ff x^{}
bor ff x^{0}
(if x^{1} then x^{1} else x^{0}) where x^{1} := ff
let x^{1} := ff in if x^{1} then x^{1} else x^{0}
if x^{1,2} then x^{1,2} else x^{0,2}

If I read the paper correctly, then the x^{0,2} at the end resolves to x^{0} instead of x^{1}, because {0} is the (largest) subset of {0,2}.

leodemoura commented 7 years ago

So the alternative is to expand everything directly into expressions, without being to use syntax as an intermediate step?

It is not clear to me what is a good alternative. It seems each approach has problems/limitations. I'm looking forward to meet you in Cambridge and discuss these issues in person.

You mean, like renaming stream.cons to scons and then correctly renaming only half of the cons occurrences to scons?

Yes, I'm thinking of examples like this. I'm also considering the case where scons is aliased. That is, there is a foo.scons, and we have used open foo.

All refactoring tools have limits, everybody expects them to break in complicated situations. I think many refactoring tools don't even handle capture-avoiding renaming. Even IDEs such as IntelliJ will tell you "no, I can't do that" sometimes.

I'm ok if the refactoring tool has limitations, but I want to understand better what we will not be able to do before we decide which approach we will use.

BTW, I'm also concerned with hole commands. After this refactoring, they will return syntax objects representing different ways of filling a hole. Some of these fragments will fail to elaborate due to overloading, missing implicit arguments, etc. At least, in the new approach, we can try to re-elaborate these fragments before we send them to the user.

The trick is that each occurrence of x has a set of scopes attached, in this sense the two xs are different. Let's trace the calls to expand. After the first step, we invent the new scope 0 for the primitive binder let, add x^{0} to the local context, add the scope 0 to all identifiers in the body, and proceed with the subterm.

How do we know that the where is creating a scope? That is, when we go from

bor ff x^{0}

(if x^{1} then x^{1} else x^{0}) where x^{1} := ff

gebner commented 7 years ago

How do we know that the where is creating a scope?

Let me explain where each of the scopes come from / what they mean:

0: scope introduced by the let-binding that the user wrote
1: scope where the macro was defined
2: scope introduced by the let-binding of the macro expansion

(Maybe there should be another scope for the current definition, the paper is not too clear on the total number of scopes in the actual implementation.)

The beauty of the SOS approach is that expanding where does not introduce a scope. We never need to know the binding structure of macros, we just expand blindly. The scope only gets introduced when we process the let.

Note that this only works because we never look at subterms of unexpanded terms. Expansion only happens at the top-level, we only proceed to subterms of primitive constructs (let, lambda, etc.). There we know the binding structure and introduce scopes as needed.

gebner commented 7 years ago

Some idle thoughts:

syntax.node should take a tag (of type nat) just like expr does now. Otherwise it becomes difficult to trace syntax object through the elaboration process. (Just think of my example query: what is the type after elaboration? If the syntax object is under a binder, we'll instantiate it with different local constants (or scopes) every time, so we can't just use the syntax object itself as an identifier.)

After listening to Jeremy's talk today I remembered that we have another source of unhygiene, namely unresolved expressions in tactics:

-- say bor is a macro such that:
-- bor a b   ==>   let x := a in if x then x else b

example (x : nat) := bor ff (by exact x > 0)

-- expands to:
example (x : nat) :=
let x := ff in if x then x else (by exact x > 0)
                          -- ideally this ^ would still resolve to the parameter

leodemoura commented 7 years ago

example (x : nat) :=
let x := ff in if x then x else (by exact x > 0)
                          -- ideally this ^ would still resolve to the parameter

Right now, we elaborate x > 0 with respect to the current local context at tactic execution time. Moreover, name resolution is performed using user-facing names. So x will refer to x := ff :( One possible hack is to make sure that binders introduced by macro expansion use a disjoint set of user-facing names (e.g., they have the prefix _m).

What do you think? Do you have cleaner solutions in mind? Recall that the following weird example is expected to work (and it does in the current version).

example (x : nat) (y : bool) : bool :=
begin
  rename y x,
  exact bnot x -- works, x is now referring to `y : bool`.
end

gebner commented 7 years ago

One possible hack is to make sure that binders introduced by macro expansion use a disjoint set of user-facing names (e.g., they have the prefix _m).

That's exactly what I had in mind. We already use underscore as a prefix for such private names, and I think that one way or another, macro expansions should just get their bound names prefixed by an underscore. I see two possible points where this can happen:

Provide an "extra-hygienic" syntax quote, e.g. quadruple-backticks preserves bound names, quintuple-backticks prefixes all bound names with underscores.
Prefix the bound names during substitution of the macro replacement.

When constructing syntax objects manually, the underscores need to be added manually.

The underscores pose no problem for pretty-printing (e.g., when expanding a macro in the editor), the unresolve function can remove the underscore to produce names that we can parse again.

Another option is to remember the local context of the begin-end block during name resolution. At tactic runtime we would prefix all user-facing names of local constants that were not present during name resolution with an underscore.

(A similar approach is possible with SOS: we would remember the scope set of the begin-end block during expansion, and then use the same scope set when expanding x > 0)

gebner commented 7 years ago

By popular demand over dinner: here is the mockup code.