We have talked about improving the performance of reduction in the kernel from time to time, so it's probably a good idea to start writing down issues and ideas we have. I even heard we may have someone who wants to work on this this year!

For starters, let us define the problem more precisely: conversion, or definitional equality, checking is a core operation of the kernel, and it relies on reduction of terms in the form of weak head normal form reduction. When the involved terms may also contain metavariables, we instead talk about unification, which has its own performance issues such that optimizing it may be better left to a separate issue. Currently, the Lean elaborator written in Lean will always use a unification algorithm, while the kernel written in C++ uses a separate conversion checking implementation that is more or less the same algorithm without support for metavariables.

In current Lean code, conversion checking is rarely a bottleneck: at the time of writing, about 2.8% of building mathlib4 is spent in any part of the kernel; how much time is spent on unifying mvar-free terms in the elaborator is however unknown. What the current performance of conversion mostly prevents us from is to explore new applications where we have big conversion problems that never went through unification, such as in proof by reflection. Indeed, @amahboubi pointed me to the paper "Formalized Class Group Computations and Integral Points on Mordell Elliptic Curves" by @Vierkantor et al. where they state about the design of a tactic

In addition to computing a normal form, the tactic also assembles a proof that the input expression is equal to its normal form. Constructing a proof term tends to be faster in Lean 3 than proof by reflection, since the interpreter for metaprograms is usually more efficient than kernel reduction

The use of proof by reflection is also well-established in the Coq world (Grégoire & Leroy: A Compiled Implementation of Strong Reduction):

A typical example is the use of computations on binary decision diagrams to prove results on boolean formulas [23]. Another example is Appel and Haken’s famous proof of the 4-color theorem [2], which involves checking colorability of a finite, but large, number of well-chosen planar graphs: rather than developing a proof of 4-colorability for every such graph, a proved decision procedure is invoked on all of them.

Prior Work

Probably the system with the most existing work on speeding up conversion is Coq, with as many as three different reduction engines in the kernel triggered by annotations in the core term:

The default is a symbolic interpreter roughly similar to Lean's, though one performance killer in the form of locally-nameless term substitution is avoided by use of explicit substitution and de-Bruijn indices. @ppedrot mentioned to me that he implemented a nifty data structure to represent the involved stack, which I assume is https://github.com/coq/coq/pull/6506.
vm_compute uses an "optimized call-by-value evaluation bytecode-based virtual machine" as in (untyped) normalization by evaluation.
native_compute improves upon the former by compiling to native code. The paper emphasizes that is done using the vanilla OCaml compiler, not a custom abstract machine as before, without introducing additional overhead by (ab)using OCaml's existing runtime tagging. As our object layout is very similar and we have full control over the runtime, this should be possible for us as well.

Fast reduction tactics: vm_compute and native_compute describes how one should decide between the latter two options given their different startup vs. runtime performance.

The unifier can switch to kernel reduction when encountering closed terms (though this is not always faster).

@AndrasKovacs' normalization-bench tests native_compute-like reduction across various runtimes. smalltt documents and implements many ideas on fast conversion (and elaboration!) along with extensive cross-system benchmarks:

Importantly, we can make do with a single evaluator for all purposes, with fairly good performance. In contrast, Agda, Coq and Lean all have multiple evaluators, and in all cases only the slowest evaluators can be used without restrictions during elaboration. As we'll see in benchmarks, smalltt is robustly faster than all "slow" evaluators, and can be faster or slower than the Coq bytecode VM depending on workloads.

Finally, we do have the special functions reduceBool and reduceNat in Lean 4 that use interpreted and/or native code depending on precompilation, but these are limited to the respective types and to closed terms.

leanprover / lean4

Faster reduction and conversion #2163

Prior Work