Streamline Nova allocations (Arecibo backport)

huitseeker commented 7 months ago

What

This backports the following PRs in Arecibo:

Why

These PRs streamline the memory allocations occuring in performance-sensitive parts of Nova by applying a few simple techniques:

pre-allocating known-size vectors,
when a known-size amount of data will be allocated on each iteration of a tight loop, reuse the allocation from the prior iteration,
read arguments by reference in few-argument matrix operations rather than copying those arguments to a single contiguous allocation

Outcome

Each PR (in Arecibo: each commit here) has been benchmarked on Lurk-size circuits showing its impact on large witnesses / instances, and each shows a 5-8% improvement on proving latency.

Local benchmarks (small circuits, no memory pressure, high variance)

critcmp results (meant to demonstrate this does not make things worse from a CPU PoV)

``` group main nova_allocations ----- ---- ---------------- CompressedSNARK-Commitments-StepCircuitSize-0/Prove 1.00 7.3±0.08s 1.00 7.3±0.12s CompressedSNARK-Commitments-StepCircuitSize-0/Verify 1.00 188.3±5.29ms 1.02 191.5±4.17ms CompressedSNARK-Commitments-StepCircuitSize-121253/Prove 1.01 23.3±0.58s 1.00 23.1±0.20s CompressedSNARK-Commitments-StepCircuitSize-121253/Verify 1.05 728.1±60.72ms 1.00 694.4±20.50ms CompressedSNARK-Commitments-StepCircuitSize-22949/Prove 1.00 11.6±0.16s 1.00 11.6±0.16s CompressedSNARK-Commitments-StepCircuitSize-22949/Verify 1.07 333.4±31.70ms 1.00 312.0±10.16ms CompressedSNARK-Commitments-StepCircuitSize-252325/Prove 1.00 43.4±0.87s 1.01 43.9±0.78s CompressedSNARK-Commitments-StepCircuitSize-252325/Verify 1.03 1213.0±107.21ms 1.00 1173.6±48.71ms CompressedSNARK-Commitments-StepCircuitSize-55717/Prove 1.00 19.7±0.35s 1.00 19.8±0.43s CompressedSNARK-Commitments-StepCircuitSize-55717/Verify 1.00 599.4±29.04ms 1.00 599.9±28.39ms CompressedSNARK-Commitments-StepCircuitSize-6565/Prove 1.00 10.8±0.13s 1.00 10.7±0.09s CompressedSNARK-Commitments-StepCircuitSize-6565/Verify 1.04 302.6±10.64ms 1.00 291.6±10.66ms CompressedSNARK-StepCircuitSize-0/Prove 1.01 799.1±26.03ms 1.00 791.7±10.32ms CompressedSNARK-StepCircuitSize-0/Verify 1.02 31.3±1.15ms 1.00 30.7±0.61ms CompressedSNARK-StepCircuitSize-1038757/Prove 1.00 48.6±0.76s 1.00 48.4±1.05s CompressedSNARK-StepCircuitSize-1038757/Verify 1.00 1599.8±38.46ms 1.01 1622.6±36.11ms CompressedSNARK-StepCircuitSize-121253/Prove 1.00 6.5±0.27s 1.01 6.6±0.15s CompressedSNARK-StepCircuitSize-121253/Verify 1.02 210.2±12.48ms 1.00 206.0±12.69ms CompressedSNARK-StepCircuitSize-22949/Prove 1.00 1865.4±41.89ms 1.01 1879.3±40.81ms CompressedSNARK-StepCircuitSize-22949/Verify 1.00 70.5±3.44ms 1.00 70.3±2.00ms CompressedSNARK-StepCircuitSize-252325/Prove 1.00 12.7±0.36s 1.01 12.8±0.37s CompressedSNARK-StepCircuitSize-252325/Verify 1.00 438.9±21.82ms 1.04 455.3±23.93ms CompressedSNARK-StepCircuitSize-514469/Prove 1.00 24.7±0.63s 1.00 24.6±0.63s CompressedSNARK-StepCircuitSize-514469/Verify 1.01 701.7±30.84ms 1.00 697.0±31.46ms CompressedSNARK-StepCircuitSize-55717/Prove 1.01 3.5±0.12s 1.00 3.5±0.11s CompressedSNARK-StepCircuitSize-55717/Verify 1.00 123.0±4.74ms 1.02 124.9±7.61ms CompressedSNARK-StepCircuitSize-6565/Prove 1.00 1147.2±13.65ms 1.03 1180.6±52.56ms CompressedSNARK-StepCircuitSize-6565/Verify 1.00 46.8±1.97ms 1.01 47.2±1.51ms RecursiveSNARK-StepCircuitSize-0/Prove 1.00 68.6±4.89ms 1.02 69.7±12.03ms RecursiveSNARK-StepCircuitSize-0/Verify 1.01 28.2±0.79ms 1.00 28.0±0.57ms RecursiveSNARK-StepCircuitSize-1038757/Prove 1.00 1504.2±82.11ms 1.00 1500.8±37.38ms RecursiveSNARK-StepCircuitSize-1038757/Verify 1.00 1390.6±33.57ms 1.02 1423.2±81.45ms RecursiveSNARK-StepCircuitSize-121253/Prove 1.00 224.4±5.93ms 1.04 233.0±11.44ms RecursiveSNARK-StepCircuitSize-121253/Verify 1.01 177.0±7.56ms 1.00 176.0±9.24ms RecursiveSNARK-StepCircuitSize-22949/Prove 1.00 107.8±7.98ms 1.01 108.7±12.36ms RecursiveSNARK-StepCircuitSize-22949/Verify 1.00 59.1±2.03ms 1.00 59.4±0.88ms RecursiveSNARK-StepCircuitSize-252325/Prove 1.00 393.3±15.95ms 1.04 407.6±16.87ms RecursiveSNARK-StepCircuitSize-252325/Verify 1.00 343.9±25.63ms 1.01 347.8±15.34ms RecursiveSNARK-StepCircuitSize-514469/Prove 1.00 781.9±21.67ms 1.02 793.9±25.06ms RecursiveSNARK-StepCircuitSize-514469/Verify 1.02 718.6±28.72ms 1.00 706.0±30.34ms RecursiveSNARK-StepCircuitSize-55717/Prove 1.00 151.4±7.48ms 1.03 155.5±13.56ms RecursiveSNARK-StepCircuitSize-55717/Verify 1.00 92.6±2.94ms 1.08 100.3±7.13ms RecursiveSNARK-StepCircuitSize-6565/Prove 1.01 77.4±4.27ms 1.00 76.6±4.25ms RecursiveSNARK-StepCircuitSize-6565/Verify 1.00 36.3±0.65ms 1.02 37.0±1.56ms ```

h/t @winston-h-zhang 👏

srinathsetty commented 7 months ago

Hi @huitseeker thanks for the PR!

RecursiveSNARK-StepCircuitSize-0/Prove 1.00 68.6±4.89ms 1.02 69.7±12.03ms RecursiveSNARK-StepCircuitSize-6565/Prove 1.01 77.4±4.27ms 1.00 76.6±4.25ms RecursiveSNARK-StepCircuitSize-22949/Prove 1.00 107.8±7.98ms 1.01 108.7±12.36ms RecursiveSNARK-StepCircuitSize-55717/Prove 1.00 151.4±7.48ms 1.03 155.5±13.56ms RecursiveSNARK-StepCircuitSize-121253/Prove 1.00 224.4±5.93ms 1.04 233.0±11.44ms RecursiveSNARK-StepCircuitSize-252325/Prove 1.00 393.3±15.95ms 1.04 407.6±16.87ms RecursiveSNARK-StepCircuitSize-514469/Prove 1.00 781.9±21.67ms 1.02 793.9±25.06ms RecursiveSNARK-StepCircuitSize-1038757/Prove 1.00 1504.2±82.11ms 1.00 1500.8±37.38ms

From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly?

huitseeker commented 7 months ago

From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly?

This is correct and expected. This benchmark does not stress the allocator.

This work was originally developed as a bundled package in https://github.com/lurk-lab/arecibo/pull/118 and tested in https://github.com/lurk-lab/lurk-rs/pull/890. What you're seeing here is broken-down changes. In particular, for https://github.com/lurk-lab/arecibo/pull/139 we have benches at https://github.com/lurk-lab/lurk-rs/pull/923.

srinathsetty commented 6 months ago

From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly?

This is correct and expected. This benchmark does not stress the allocator.

At what circuit size is this PR going to provide benefits? The PR proposes a fairly large, arguably somewhat complicated-to-maintain, change to the way memory is managed. I think it would be good to rigorously understand the benefits it provides. Specifically, for what circuits and what step circuit sizes is this beneficial?

porcuquine commented 6 months ago

From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly?

This is correct and expected. This benchmark does not stress the allocator.

At what circuit size is this PR going to provide benefits? The PR proposes a fairly large, arguably somewhat complicated-to-maintain, change to the way memory is managed. I think it would be good to rigorously understand the benefits it provides. Specifically, for what circuits and what step circuit sizes is this beneficial?

Let me interpret the Lurk benchmarks shown here (you may need to click the disclosure triangle to see them. I will reproduce below): https://github.com/lurk-lab/lurk-rs/pull/923#issuecomment-1827137943

Test	Base	PR	%
LEM Fibonacci Prove - rc = 100/fib/num-100	4.8±0.03s	4.4±0.03s	-8.33%
LEM Fibonacci Prove - rc = 100/fib/num-200	9.7±0.33s	9.2±0.08s	-5.15%
LEM Fibonacci Prove - rc = 600/fib/num-100	3.8±0.01s	N/A	N/A
LEM Fibonacci Prove - rc = 600/fib/num-200	8.4±0.10s	N/A	N/A

The rc referenced is 'reduction count'. That is the number of inner Lurk circuits bundled in the application circuit. The Lurk circuit is about 12,000 constraints, so rc=100 is about 1.2M constraints. The fibonacci part (num-100, num-200) determines how large the calculation is, and we can derive the number of folding steps from that. In these cases, the calculation uses 11 and 33 steps respectively.

So, to synthesize, on Lurk's fibonacci workload, the answer to your question is:

For 1.2M constraints iterated 17 times, we see an 8.33% improvement. For 1.2M constraints iterated 33 times, we see a 5.15% improvement.

Note that these metrics are a bit tricky to interpret because the first step is cheaper and therefore biases toward better performance on shorter computations.

It's possible I presented some of this info wrong, but that is the basic answer. @winston-h-zhang can fill in or fix any details I omitted or got wrong.

huitseeker commented 6 months ago

For more signal on memory, here is the top-level memory profiling of the last round of the minroot example (65536 it/step) using bytehound. First is on main, second is the PR.

As you can see, the overall memory usage is far from taxing for a modern machine. But it's sufficient to see each of the nine steps of folding that each create their short-lived allocation on main (and which do not on the PR).

As to the extrapolation of how this leads to a degradation in performance through memory fragmentation in larger proofs, here's what happens if if zoom out of the first picture (same dataset, on main) to see all 7 rounds of Minroot :

As you can see, the memory usage pattern scales linearly with the number of constraints. Another way to interpret this is that, for a fixed number of constraints, the performance impact of this PR (through fragmentation) will depend on how memory-pressured the machine is.

winston-h-zhang commented 6 months ago

To elaborate more on @porcuquine and @huitseeker, I've provided some more benchmarks that give more context to why we've pursued this new memory model, and the breakdowns of improvements each PR provides.

Context

In theory, having many large reallocations of vectors puts a lot of load on the memory allocator, due to factors like fragmentation. Thus, we should try to avoid reallocating memory whenever it is possible to reuse already available space. Concretely, the following benchmark compares running the proving pipeline under (1) ideal allocator conditions and (2) bad allocator conditions.

rc	Test	Base (fresh)	Base (load)	%
100	Fibonacci(100)	6.60s	7.47s	113%
	Fibonacci(200)	14.58s	14.89s	102%
600	Fibonacci(100)	4.33s	6.34s	146%
	Fibonacci(200)	10.24s	14.65s	143%

In this context, Lurk offers two operational scenarios: "fresh" and "load." In the "fresh" scenario, public parameters are regenerated for each run of the prover, while in the "load" scenario, these parameters are read from a persistent source, like disk storage. The choice between "fresh" and "load" does not semantically alter Lurk's functioning. However, a significant performance regression has been observed, seemingly without a clear cause. This issue illustrates a broader potential problem: changes in system variables that do not semantically impact the proving process could still inadvertently trigger regressions. This is particularly problematic due to the current instability of the memory pipeline, where even minor alterations might disrupt the allocator and cause regressions in a way that is hard to detect and debug. A key distinction between the two scenarios is the differing levels of memory pressure they exert.

Breakdown of Improvements

The following 3 benchmarks show how each PR additionally improves the regression we observe, until there is no more regression. Note, each next benchmark includes all the previous improvements.

https://github.com/lurk-lab/arecibo/pull/139

rc	Test	Base (load)	remove r1cs clones	%
100	Fibonacci(100)	7.47s	6.60s	88%
	Fibonacci(200)	14.89s	13.16s	88%
600	Fibonacci(100)	6.34s	5.58s	88%
	Fibonacci(200)	14.65s	12.36s	84%

https://github.com/lurk-lab/arecibo/pull/143

rc	Test	Base (load)	preallocated witness + previous	%
100	Fibonacci(100)	7.47s	6.54s	87%
	Fibonacci(200)	14.89s	12.90s	86%
600	Fibonacci(100)	6.34s	5.04s	79%
	Fibonacci(200)	14.65s	11.38s	77%

https://github.com/lurk-lab/arecibo/pull/144

rc	Test	Base (load)	large vector allocations + previous	%
100	Fibonacci(100)	7.47s	5.71s	76%
	Fibonacci(200)	14.89s	11.32s	76%
600	Fibonacci(100)	6.34s	4.98s	78%
	Fibonacci(200)	14.65s	10.50s	71%

Results

Significant improvements are not anticipated when comparing with our "fresh" scenario, as the comparison is made against a scenario that operates without memory pressure. Consequently, this scenario does not experience the drawbacks of an architecture struggling to cope with such pressure. However, the significant improvement compared to our "load" scenario is strong evidence that the memory pipeline is much more stable.

srinathsetty commented 5 months ago

Thanks for adding all the context! This is a very important optimization in the big picture. We would like to eventually get this feature merged.

One thing that we need to get better clarity is how this kind of optimization will impact future code changes. For example, in some instances, this inherently requires the caller to allocate and manage memory of the callee.

Have you considered making this to be based on traits? For example, in the current PR (assuming I understand the full code correctly), pub struct ResourceBuffer<E: Engine> (defined in lib.rs) seems to be very much tied to the particulars of nifs.rs. Is there a pattern in which we can have nifs.rs define certain types that lib.rs can use as a black box (with certain trait methods)?

The reason to bring up this design is in the future, we might abstract away nifs.fs with a trait so that lib.rs knows nothing about the internal details.

microsoft / Nova