Metavariable instantiation seems to slow down non-linearly in the size of the proof

Prerequisites

Please put an X between the brackets as you perform the following steps:

[X] Check that your issue is not already filed: https://github.com/leanprover/lean4/issues
[ ] Reduce the issue to a minimal, self-contained, reproducible test case. Avoid dependencies to Mathlib or Batteries.
- This is rather hard, since the problem is one of scalability.
[ ] Test your test case against the latest nightly release, for example on https://live.lean-lang.org/#project=lean-nightly (You can also use the settings there to switch to “Lean nightly”)
- Again, hard due to the lack of an MWE

Description

In (LNSym)[github.com/leanprover/LNSym] we produce quite large proofs, which seem to be bottlenecked by metavariable instantiation: in our most extreme benchmark, elaborating a theorem takes about 18 seconds in total, of which 10 seconds are spent in instantiateMVars.

Firstly, this time was not showing up in the trace.profiler output, but this is fixed in #5501.

The bigger question, is why is metavariable instantiation so slow? Here's a table of some benchmarks we've run, with links to the source code and the resulting profile.

Benchmark	Profile	Elaboration Time (total)	`instantiateMVars`
SHA512_150	https://share.firefox.dev/4dsm3Is	3.9s	1.3s (34%)
SHA512_225	https://share.firefox.dev/4dsm3Is	6.9s	3s (44%)
SHA512_400	https://share.firefox.dev/4ev6y3E	17.7s	10s (56%)

To be clear: the instantiateMVars runtime reported in the table measures only the calls done in the MutualDef elaborator.

Context

To investigate the issue, we've experimented with a wrapper that remembers the goal metavariable at the start, runs the tactic, and then at the end reassigns the metavariable to it's instantiation (assuming that it's been assigned) --- see here for where the wrapper is used.

Surprisingly this had almost no effect on the time spent on metavariable instantiation in the elaborator code. The benchmarks table above use the commit which has this wrapper. To see the time taken, expand the tactic nodes until you see Tactic.sym: withInstantiatMainGoal.instantiateMVar and observe that even in the largest benchmark, instantiating mvars here takes only 5ms, as opposed to the 10 seconds spent in the elaborator.

Also, to be clear: these profiles were obtained by running Lean from the terminal, but running them in interactively in Lean gives roughly the same timings: this issue seems orthogonal from what's going on in #5614.

Steps to Reproduce

Clone LNSym
Checkout commit 8384b193b4c727da83b200f5132c7b4620d3b03d to get the experimental wrapper
Run make profiles to profile the benchmarks
You'll find the generated profile json files under data/profiles

If you're not running a toolchain with #5501 cherry-picked on it, you can deduce how much time is spent on instantiating mvars by looking at the self-time of the Elab.command node.

Expected behavior: I wouldn't expect the majority of elaboration time to be spent on instantiating mvars.

Actual behavior: The time spent on instantiating mvars seems to scale non-linearly with the size of the proof (or at least, it scales non-linearly with the time spent on evaluating the tactic that produced the proof), even if we force the tactic to fully instantiate it's original main goal.

Versions

nightly-2024-09-10, with #5501 cherry-picked on it (specifically, opencompl/fix-timeleak-nightly-2024-09-10)

Additional Information

[Additional information, configuration or data that might be necessary to reproduce the issue]

Impact

Add :+1: to issues you consider important. If others are impacted by this issue, please ask them to add :+1: to it.

It would be very helpful to get an idea what causes instantiateMVars to become so slow. For another datapoint, consider https://github.com/leanprover/LNSym/pull/219. This PR replaces a use of evalTactic with a direct implementation using Meta API. Compare a profile from before the change with a profile of the same file afterwards.

You'll notice the former takes 8 seconds in total, of which 1.2 is spent in "Elab.step: tacticInit_next_step_" (init_next_step is the tactic that was being evalTacticd). In the second profile, initNextStep (the replacement) doesn't even make the profiler's threshold, but we can see that we spend much less time executing the tactic (the parent "Elab.step: tacticSymn" node was about 3.3 seconds before, and is 1.1 seconds in the second profile; a difference of exactly 1.2 seconds).

However, all of that is lost on the metavariable instantiation. This profile was made with a toolchain version that doesn't have a node for instantiateMVar, but we can deduce how much is spent on it from the self time of Elab.command: this was 3.8 seconds in the first profile, and 11.7 seconds in the second.

Any hints as to what could've caused such a dramatic increase in the time taken to instantiate metavars is appreciated!

Unfortunately, the slowdown seems to not be explained just by proof sizes. In fact, I ran some experiments on https://github.com/leanprover/LNSym/pull/215, which implements a change that causes a pretty big slowdown in the time taken for instantiateMVars, but when I measured the size of the proofs, those generated after the change seemed to be smaller, despite being slower.

With the caveat that the slowdown was measured by symbolically simulating at least 50 steps of SHA512, but I wasn't able to get numbers for proof sizes for more than 7 steps, so it is possible that somehow for the first 7 steps, the changed version generated a smaller proof, but somehow for 50 steps the proof becomes larger.

I’ve identified the performance issue. The tactic produces a long sequence of nested, delayed-assigned metavariables. I confirmed that the sequence length exceeds 4,000.

Here’s a simplified/abstracted version of what’s happening:

It begins with a term like ?m_1 a_1, where ?m_1 has a delayed assignment of the form:

?m_1 #[x_1] := ... (?m_2 a_2) ...

where ?m_2 also has a delayed assignment of the form:

?m_2 #[x_2] := ... (?m_3 a_3) ...

and this continues until we reach:

?m_4000 #[x_4000] := ... -- no metavariable here

The current implementation constructs the resulting term from the bottom up, starting with ?m_4000.

Furthermore, at each step, replace_fvars is used to replace occurrences of x_i with a_i after replacing ?m_i with its assignment. This process gradually builds a larger and larger term, and as a result, replace_fvars becomes increasingly expensive.

I tried two different approaches, but neither produced significant performance improvements. I am now considering special treatment for proofs. For example, skipping the beta-reduction step here significantly improves runtime (by about 50%) since it reduces the number of terms containing free variables. However, this is not a robust solution.

I also considered creating auxiliary lemmas, but it's messy. We would need all the machinery for MetaM, and there's no guarantee that it would solve the performance issue after we add all this complexity. For using MetaM, we need to avoid loose bound variables.

Another option I am considering is using a delayed substitution auxiliary term within instantiate_mvars. Before returning the result, we would perform a single traversal to eliminate all them. Not sure it will solve all issues because update the metavar context assignment, and these delayed substitution terms should not leak into the assigment.

Thanks for the investigation @leodemoura!

Where exactly do these delayed mvars get created? I originally thought the various Meta APIs that we use create regular assignments, but from your diagnosis I assume those can introduce delayed assignments also. Is there a way for us to work around the problem by avoiding such delayed assignments altogether?

leanprover / lean4