Optimizing terms for kernel reduction (meta issue)

nomeata commented 2 weeks ago

This is a meta-issue that I’ll use to track my progress in creating a toolkit for faster kernel reduction. This is to help me organize the work, but also to provide more visibility to interested members of the community.

Background

Lean has two mechanisms for computations in proof: by decide, which uses Kernel.whnf, and by native_decide, which uses the lean compiler to then run native code. The former tends to be slow, the latter has a significantly larger TCB. I investigated ways to find a middle ground here, something that allows larger computations than what we can handle well in Kernel.whnf, but still a noticably smaller TCB.

While a faster Kernel.whnf is possible by using a different evaluator (e.g. something like Sestof's abstract machine) and would give a speed up of 2-3×, this does quite justify the increase in kernel code. Other ideas, such a certified compilation to some lower-level VM that is interpreted by the kernel, are appealing, but too large tasks to tackle right away.

In these investigations I noticed that the code we send to the kernel is not particularly optimized (recursive functions use complex encodings, numerical operations are overloaded, computations on subtypes like Fin perform lot of wrapping and unwrapping). A quick experiment shows that a 10× speed-up is possible in some cases.

However just applying this trick broadly can lead to an overall slow-down, because for small calculations, it’s easier to just reduce it, rather than optimizing it first.

Goal

Therefore the goal of this project is to create a toolkit hat can be used by advanced users to carefully optimize a specific expensive proof, or to implement a particular proof-by-reflection tactic.

Design

The idea is to have a by rsimp_decide tactic that uses the simplifier (which is a versatile tool for certified rewrites of terms) with a custom simpset (rsimp) to rewrite the expression that the kernel will evaluate first.

Component overview

(These are rough specifications, some syntax or interaction will become clearer as I build these things.)

[ ] A rsimp simp set. Initially without any promise of standard library coverage; instead early adoptor are expected to extend it as needed.
[ ] A rsimp_decide tactic using this simpset
[ ] A rsimp_optimize attribute or command that takes a definition f, takes its RHS, optimizes it, defines f.rsimp_opt and adds f = f.rsimp_opt to the rsimp simpset.
[ ] The above, when run on a recursive function (more general: any theorem that has the shape of an unfolding theorem), to use Nat.rec and huge initial fuel to create a non-recursive definition that reduces efficiently. This will, in particular, unblock the use of well-founded recursion in such applications.
[ ] (Maybe) A variant of the above that can take a theorem @decide (P x) (inst x) = … @decide (P y) (inst y) …, and produce a rsimp rule @decide (P x) (inst x) = Nat.rec …, to simpliy recursive Decidable instances.
[ ] (Maybe) Pursue a Decidable refactoring so that @decide P inst is a plain function on Bool, and may reduce more easily (see #2038)
Data structures optimized for in-kernel reduction
- [ ] A binary search tree for Nat → α maps, optimized for querying, to replace static arrays
- [ ] (Maybe) Ensure that using large Nat literals as packed arrays or bitvectors works well (i.e. that their representation in C files is efficient, see https://github.com/leanprover/lean4/issues/5771)
- [ ] See if the ongoing work on weighted balanced trees results in something that might work well here.
Kernel tweaks

Despite the goal of not touching the kernel too much, there might be room for small useful tweaks.
- [ ] See if letting the kernel reduce Nat.rec without consulting the environment gives a worthwhile performance boost, given the above use of Nat.rec for recursion.
- [ ] Maybe add strictLet : a → (a → b) → b which, when reduced by the kernel, causes the first argument to whnf’ed first. Otherwise left-folds will leak memory and stack space. This could also be used to ensure sharing of results.

Evaluation

Once things are in place, it would be good to see that it can be applied successfully to

Computations with well-founded recursion, like used in mathlib in Nat.sqrt and Nat.minFacAux
Try to tweak some of the bv_decide machinery using this toolkit, and see if we can handle smallish SAT proofs without native_decide that way.

Command-Master commented 6 days ago

For some cases, the computation can be verified more quickly than it can be done. For example, in Nat.decidableExistsLE, from what I understand the kernel currently checks all values until one works, but optimally the value which satisfies the predicate can be found in native code and then given to the kernel. Other examples of this phenomenon (although they require more mathematics) is when computing Nat.gcd a b, native code can compute values x, y, z, w such that z * (x * a - y * b) = a, w * (x * a - y * b) = b and then the gcd is x * a - y * b, or factorization, where native code can factor the number and produce primality certificates

Command-Master commented 6 days ago

There was also some discussion about this topic in https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/Large-ish.20computation.20without.20Lean.2EofReduceBool . Is there some way to profile kernel reduction, to understand where the term could be optimized?

nomeata commented 5 days ago

Absolutely true that often you can split the computation into unverified data generation and verified checking; that is not the topic of this issue, though.

Better profiling for kernel reduction would be great, and I'm missing it too, but isn't really available right now.

leanprover / lean4