Open Kha opened 1 year ago
As for ideas, there is a picture that has burned itself into my mind regarding interpretation vs. native computation overhead: This is from a figure from "Copy-and-patch compilation: a fast compilation algorithm for high-level languages and bytecode" by Xu and Kjolstad, which I learned about on Mastodon via @julesjacobs. It is not clear how practical the described approach is to implement given that it doesn't seem to have been replicated so far outside of one citing paper, but I don't see why it should not be applicable to this topic as well with hopefully similar results.
We have talked about improving the performance of reduction in the kernel from time to time, so it's probably a good idea to start writing down issues and ideas we have. I even heard we may have someone who wants to work on this this year!
For starters, let us define the problem more precisely: conversion, or definitional equality, checking is a core operation of the kernel, and it relies on reduction of terms in the form of weak head normal form reduction. When the involved terms may also contain metavariables, we instead talk about unification, which has its own performance issues such that optimizing it may be better left to a separate issue. Currently, the Lean elaborator written in Lean will always use a unification algorithm, while the kernel written in C++ uses a separate conversion checking implementation that is more or less the same algorithm without support for metavariables.
In current Lean code, conversion checking is rarely a bottleneck: at the time of writing, about 2.8% of building mathlib4 is spent in any part of the kernel; how much time is spent on unifying mvar-free terms in the elaborator is however unknown. What the current performance of conversion mostly prevents us from is to explore new applications where we have big conversion problems that never went through unification, such as in proof by reflection. Indeed, @amahboubi pointed me to the paper "Formalized Class Group Computations and Integral Points on Mordell Elliptic Curves" by @Vierkantor et al. where they state about the design of a tactic
The use of proof by reflection is also well-established in the Coq world (Grégoire & Leroy: A Compiled Implementation of Strong Reduction):
Prior Work
Probably the system with the most existing work on speeding up conversion is Coq, with as many as three different reduction engines in the kernel triggered by annotations in the core term:
vm_compute
uses an "optimized call-by-value evaluation bytecode-based virtual machine" as in (untyped) normalization by evaluation.native_compute
improves upon the former by compiling to native code. The paper emphasizes that is done using the vanilla OCaml compiler, not a custom abstract machine as before, without introducing additional overhead by (ab)using OCaml's existing runtime tagging. As our object layout is very similar and we have full control over the runtime, this should be possible for us as well.Fast reduction tactics: vm_compute and native_compute describes how one should decide between the latter two options given their different startup vs. runtime performance.
The unifier can switch to kernel reduction when encountering closed terms (though this is not always faster).
@AndrasKovacs' normalization-bench tests native_compute-like reduction across various runtimes. smalltt documents and implements many ideas on fast conversion (and elaboration!) along with extensive cross-system benchmarks:
Finally, we do have the special functions
reduceBool
andreduceNat
in Lean 4 that use interpreted and/or native code depending on precompilation, but these are limited to the respective types and to closed terms.