leanprover / lean4

Lean 4 programming language and theorem prover
https://lean-lang.org
Apache License 2.0
4.5k stars 393 forks source link

Faster reduction and conversion #2163

Open Kha opened 1 year ago

Kha commented 1 year ago

We have talked about improving the performance of reduction in the kernel from time to time, so it's probably a good idea to start writing down issues and ideas we have. I even heard we may have someone who wants to work on this this year!

For starters, let us define the problem more precisely: conversion, or definitional equality, checking is a core operation of the kernel, and it relies on reduction of terms in the form of weak head normal form reduction. When the involved terms may also contain metavariables, we instead talk about unification, which has its own performance issues such that optimizing it may be better left to a separate issue. Currently, the Lean elaborator written in Lean will always use a unification algorithm, while the kernel written in C++ uses a separate conversion checking implementation that is more or less the same algorithm without support for metavariables.

In current Lean code, conversion checking is rarely a bottleneck: at the time of writing, about 2.8% of building mathlib4 is spent in any part of the kernel; how much time is spent on unifying mvar-free terms in the elaborator is however unknown. What the current performance of conversion mostly prevents us from is to explore new applications where we have big conversion problems that never went through unification, such as in proof by reflection. Indeed, @amahboubi pointed me to the paper "Formalized Class Group Computations and Integral Points on Mordell Elliptic Curves" by @Vierkantor et al. where they state about the design of a tactic

In addition to computing a normal form, the tactic also assembles a proof that the input expression is equal to its normal form. Constructing a proof term tends to be faster in Lean 3 than proof by reflection, since the interpreter for metaprograms is usually more efficient than kernel reduction

The use of proof by reflection is also well-established in the Coq world (Grégoire & Leroy: A Compiled Implementation of Strong Reduction):

A typical example is the use of computations on binary decision diagrams to prove results on boolean formulas [23]. Another example is Appel and Haken’s famous proof of the 4-color theorem [2], which involves checking colorability of a finite, but large, number of well-chosen planar graphs: rather than developing a proof of 4-colorability for every such graph, a proved decision procedure is invoked on all of them.

Prior Work

Probably the system with the most existing work on speeding up conversion is Coq, with as many as three different reduction engines in the kernel triggered by annotations in the core term:

Fast reduction tactics: vm_compute and native_compute describes how one should decide between the latter two options given their different startup vs. runtime performance.

The unifier can switch to kernel reduction when encountering closed terms (though this is not always faster).

@AndrasKovacs' normalization-bench tests native_compute-like reduction across various runtimes. smalltt documents and implements many ideas on fast conversion (and elaboration!) along with extensive cross-system benchmarks:

Importantly, we can make do with a single evaluator for all purposes, with fairly good performance. In contrast, Agda, Coq and Lean all have multiple evaluators, and in all cases only the slowest evaluators can be used without restrictions during elaboration. As we'll see in benchmarks, smalltt is robustly faster than all "slow" evaluators, and can be faster or slower than the Coq bytecode VM depending on workloads.

Finally, we do have the special functions reduceBool and reduceNat in Lean 4 that use interpreted and/or native code depending on precompilation, but these are limited to the respective types and to closed terms.

Kha commented 1 year ago

As for ideas, there is a picture that has burned itself into my mind regarding interpretation vs. native computation overhead: image This is from a figure from "Copy-and-patch compilation: a fast compilation algorithm for high-level languages and bytecode" by Xu and Kjolstad, which I learned about on Mastodon via @julesjacobs. It is not clear how practical the described approach is to implement given that it doesn't seem to have been replicated so far outside of one citing paper, but I don't see why it should not be applicable to this topic as well with hopefully similar results.