microsoft / Nova

Nova: High-speed recursive arguments from folding schemes
MIT License
673 stars 176 forks source link

Streamline Nova allocations (Arecibo backport) #277

Open huitseeker opened 7 months ago

huitseeker commented 7 months ago

What

This backports the following PRs in Arecibo:

Why

These PRs streamline the memory allocations occuring in performance-sensitive parts of Nova by applying a few simple techniques:

Outcome

Each PR (in Arecibo: each commit here) has been benchmarked on Lurk-size circuits showing its impact on large witnesses / instances, and each shows a 5-8% improvement on proving latency.

Local benchmarks (small circuits, no memory pressure, high variance)

critcmp results (meant to demonstrate this does not make things worse from a CPU PoV) ``` group main nova_allocations ----- ---- ---------------- CompressedSNARK-Commitments-StepCircuitSize-0/Prove 1.00 7.3±0.08s 1.00 7.3±0.12s CompressedSNARK-Commitments-StepCircuitSize-0/Verify 1.00 188.3±5.29ms 1.02 191.5±4.17ms CompressedSNARK-Commitments-StepCircuitSize-121253/Prove 1.01 23.3±0.58s 1.00 23.1±0.20s CompressedSNARK-Commitments-StepCircuitSize-121253/Verify 1.05 728.1±60.72ms 1.00 694.4±20.50ms CompressedSNARK-Commitments-StepCircuitSize-22949/Prove 1.00 11.6±0.16s 1.00 11.6±0.16s CompressedSNARK-Commitments-StepCircuitSize-22949/Verify 1.07 333.4±31.70ms 1.00 312.0±10.16ms CompressedSNARK-Commitments-StepCircuitSize-252325/Prove 1.00 43.4±0.87s 1.01 43.9±0.78s CompressedSNARK-Commitments-StepCircuitSize-252325/Verify 1.03 1213.0±107.21ms 1.00 1173.6±48.71ms CompressedSNARK-Commitments-StepCircuitSize-55717/Prove 1.00 19.7±0.35s 1.00 19.8±0.43s CompressedSNARK-Commitments-StepCircuitSize-55717/Verify 1.00 599.4±29.04ms 1.00 599.9±28.39ms CompressedSNARK-Commitments-StepCircuitSize-6565/Prove 1.00 10.8±0.13s 1.00 10.7±0.09s CompressedSNARK-Commitments-StepCircuitSize-6565/Verify 1.04 302.6±10.64ms 1.00 291.6±10.66ms CompressedSNARK-StepCircuitSize-0/Prove 1.01 799.1±26.03ms 1.00 791.7±10.32ms CompressedSNARK-StepCircuitSize-0/Verify 1.02 31.3±1.15ms 1.00 30.7±0.61ms CompressedSNARK-StepCircuitSize-1038757/Prove 1.00 48.6±0.76s 1.00 48.4±1.05s CompressedSNARK-StepCircuitSize-1038757/Verify 1.00 1599.8±38.46ms 1.01 1622.6±36.11ms CompressedSNARK-StepCircuitSize-121253/Prove 1.00 6.5±0.27s 1.01 6.6±0.15s CompressedSNARK-StepCircuitSize-121253/Verify 1.02 210.2±12.48ms 1.00 206.0±12.69ms CompressedSNARK-StepCircuitSize-22949/Prove 1.00 1865.4±41.89ms 1.01 1879.3±40.81ms CompressedSNARK-StepCircuitSize-22949/Verify 1.00 70.5±3.44ms 1.00 70.3±2.00ms CompressedSNARK-StepCircuitSize-252325/Prove 1.00 12.7±0.36s 1.01 12.8±0.37s CompressedSNARK-StepCircuitSize-252325/Verify 1.00 438.9±21.82ms 1.04 455.3±23.93ms CompressedSNARK-StepCircuitSize-514469/Prove 1.00 24.7±0.63s 1.00 24.6±0.63s CompressedSNARK-StepCircuitSize-514469/Verify 1.01 701.7±30.84ms 1.00 697.0±31.46ms CompressedSNARK-StepCircuitSize-55717/Prove 1.01 3.5±0.12s 1.00 3.5±0.11s CompressedSNARK-StepCircuitSize-55717/Verify 1.00 123.0±4.74ms 1.02 124.9±7.61ms CompressedSNARK-StepCircuitSize-6565/Prove 1.00 1147.2±13.65ms 1.03 1180.6±52.56ms CompressedSNARK-StepCircuitSize-6565/Verify 1.00 46.8±1.97ms 1.01 47.2±1.51ms RecursiveSNARK-StepCircuitSize-0/Prove 1.00 68.6±4.89ms 1.02 69.7±12.03ms RecursiveSNARK-StepCircuitSize-0/Verify 1.01 28.2±0.79ms 1.00 28.0±0.57ms RecursiveSNARK-StepCircuitSize-1038757/Prove 1.00 1504.2±82.11ms 1.00 1500.8±37.38ms RecursiveSNARK-StepCircuitSize-1038757/Verify 1.00 1390.6±33.57ms 1.02 1423.2±81.45ms RecursiveSNARK-StepCircuitSize-121253/Prove 1.00 224.4±5.93ms 1.04 233.0±11.44ms RecursiveSNARK-StepCircuitSize-121253/Verify 1.01 177.0±7.56ms 1.00 176.0±9.24ms RecursiveSNARK-StepCircuitSize-22949/Prove 1.00 107.8±7.98ms 1.01 108.7±12.36ms RecursiveSNARK-StepCircuitSize-22949/Verify 1.00 59.1±2.03ms 1.00 59.4±0.88ms RecursiveSNARK-StepCircuitSize-252325/Prove 1.00 393.3±15.95ms 1.04 407.6±16.87ms RecursiveSNARK-StepCircuitSize-252325/Verify 1.00 343.9±25.63ms 1.01 347.8±15.34ms RecursiveSNARK-StepCircuitSize-514469/Prove 1.00 781.9±21.67ms 1.02 793.9±25.06ms RecursiveSNARK-StepCircuitSize-514469/Verify 1.02 718.6±28.72ms 1.00 706.0±30.34ms RecursiveSNARK-StepCircuitSize-55717/Prove 1.00 151.4±7.48ms 1.03 155.5±13.56ms RecursiveSNARK-StepCircuitSize-55717/Verify 1.00 92.6±2.94ms 1.08 100.3±7.13ms RecursiveSNARK-StepCircuitSize-6565/Prove 1.01 77.4±4.27ms 1.00 76.6±4.25ms RecursiveSNARK-StepCircuitSize-6565/Verify 1.00 36.3±0.65ms 1.02 37.0±1.56ms ```

h/t @winston-h-zhang 👏

srinathsetty commented 7 months ago

Hi @huitseeker thanks for the PR!

RecursiveSNARK-StepCircuitSize-0/Prove 1.00 68.6±4.89ms 1.02 69.7±12.03ms RecursiveSNARK-StepCircuitSize-6565/Prove 1.01 77.4±4.27ms 1.00 76.6±4.25ms RecursiveSNARK-StepCircuitSize-22949/Prove 1.00 107.8±7.98ms 1.01 108.7±12.36ms RecursiveSNARK-StepCircuitSize-55717/Prove 1.00 151.4±7.48ms 1.03 155.5±13.56ms RecursiveSNARK-StepCircuitSize-121253/Prove 1.00 224.4±5.93ms 1.04 233.0±11.44ms RecursiveSNARK-StepCircuitSize-252325/Prove 1.00 393.3±15.95ms 1.04 407.6±16.87ms RecursiveSNARK-StepCircuitSize-514469/Prove 1.00 781.9±21.67ms 1.02 793.9±25.06ms RecursiveSNARK-StepCircuitSize-1038757/Prove 1.00 1504.2±82.11ms 1.00 1500.8±37.38ms

From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly?

huitseeker commented 7 months ago

From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly?

This is correct and expected. This benchmark does not stress the allocator.

This work was originally developed as a bundled package in https://github.com/lurk-lab/arecibo/pull/118 and tested in https://github.com/lurk-lab/lurk-rs/pull/890. What you're seeing here is broken-down changes. In particular, for https://github.com/lurk-lab/arecibo/pull/139 we have benches at https://github.com/lurk-lab/lurk-rs/pull/923.

srinathsetty commented 6 months ago

From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly?

This is correct and expected. This benchmark does not stress the allocator.

At what circuit size is this PR going to provide benefits? The PR proposes a fairly large, arguably somewhat complicated-to-maintain, change to the way memory is managed. I think it would be good to rigorously understand the benefits it provides. Specifically, for what circuits and what step circuit sizes is this beneficial?

porcuquine commented 6 months ago

From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly?

This is correct and expected. This benchmark does not stress the allocator.

At what circuit size is this PR going to provide benefits? The PR proposes a fairly large, arguably somewhat complicated-to-maintain, change to the way memory is managed. I think it would be good to rigorously understand the benefits it provides. Specifically, for what circuits and what step circuit sizes is this beneficial?

Let me interpret the Lurk benchmarks shown here (you may need to click the disclosure triangle to see them. I will reproduce below): https://github.com/lurk-lab/lurk-rs/pull/923#issuecomment-1827137943

Test Base PR %
LEM Fibonacci Prove - rc = 100/fib/num-100 4.8±0.03s 4.4±0.03s -8.33%
LEM Fibonacci Prove - rc = 100/fib/num-200 9.7±0.33s 9.2±0.08s -5.15%
LEM Fibonacci Prove - rc = 600/fib/num-100 3.8±0.01s N/A N/A
LEM Fibonacci Prove - rc = 600/fib/num-200 8.4±0.10s N/A N/A

The rc referenced is 'reduction count'. That is the number of inner Lurk circuits bundled in the application circuit. The Lurk circuit is about 12,000 constraints, so rc=100 is about 1.2M constraints. The fibonacci part (num-100, num-200) determines how large the calculation is, and we can derive the number of folding steps from that. In these cases, the calculation uses 11 and 33 steps respectively.

So, to synthesize, on Lurk's fibonacci workload, the answer to your question is:

For 1.2M constraints iterated 17 times, we see an 8.33% improvement. For 1.2M constraints iterated 33 times, we see a 5.15% improvement.

Note that these metrics are a bit tricky to interpret because the first step is cheaper and therefore biases toward better performance on shorter computations.

It's possible I presented some of this info wrong, but that is the basic answer. @winston-h-zhang can fill in or fix any details I omitted or got wrong.

huitseeker commented 6 months ago

For more signal on memory, here is the top-level memory profiling of the last round of the minroot example (65536 it/step) using bytehound. First is on main, second is the PR.

On main On PR

As you can see, the overall memory usage is far from taxing for a modern machine. But it's sufficient to see each of the nine steps of folding that each create their short-lived allocation on main (and which do not on the PR).

As to the extrapolation of how this leads to a degradation in performance through memory fragmentation in larger proofs, here's what happens if if zoom out of the first picture (same dataset, on main) to see all 7 rounds of Minroot :

Screenshot 2023-12-06 at 8 53 22 PM

As you can see, the memory usage pattern scales linearly with the number of constraints. Another way to interpret this is that, for a fixed number of constraints, the performance impact of this PR (through fragmentation) will depend on how memory-pressured the machine is.

winston-h-zhang commented 6 months ago

To elaborate more on @porcuquine and @huitseeker, I've provided some more benchmarks that give more context to why we've pursued this new memory model, and the breakdowns of improvements each PR provides.

Context

In theory, having many large reallocations of vectors puts a lot of load on the memory allocator, due to factors like fragmentation. Thus, we should try to avoid reallocating memory whenever it is possible to reuse already available space. Concretely, the following benchmark compares running the proving pipeline under (1) ideal allocator conditions and (2) bad allocator conditions.

rc Test Base (fresh) Base (load) %
100 Fibonacci(100) 6.60s 7.47s 113%
Fibonacci(200) 14.58s 14.89s 102%
600 Fibonacci(100) 4.33s 6.34s 146%
Fibonacci(200) 10.24s 14.65s 143%

In this context, Lurk offers two operational scenarios: "fresh" and "load." In the "fresh" scenario, public parameters are regenerated for each run of the prover, while in the "load" scenario, these parameters are read from a persistent source, like disk storage. The choice between "fresh" and "load" does not semantically alter Lurk's functioning. However, a significant performance regression has been observed, seemingly without a clear cause. This issue illustrates a broader potential problem: changes in system variables that do not semantically impact the proving process could still inadvertently trigger regressions. This is particularly problematic due to the current instability of the memory pipeline, where even minor alterations might disrupt the allocator and cause regressions in a way that is hard to detect and debug. A key distinction between the two scenarios is the differing levels of memory pressure they exert.

Breakdown of Improvements

The following 3 benchmarks show how each PR additionally improves the regression we observe, until there is no more regression. Note, each next benchmark includes all the previous improvements.

https://github.com/lurk-lab/arecibo/pull/139

rc Test Base (load) remove r1cs clones %
100 Fibonacci(100) 7.47s 6.60s 88%
Fibonacci(200) 14.89s 13.16s 88%
600 Fibonacci(100) 6.34s 5.58s 88%
Fibonacci(200) 14.65s 12.36s 84%

https://github.com/lurk-lab/arecibo/pull/143

rc Test Base (load) preallocated witness + previous %
100 Fibonacci(100) 7.47s 6.54s 87%
Fibonacci(200) 14.89s 12.90s 86%
600 Fibonacci(100) 6.34s 5.04s 79%
Fibonacci(200) 14.65s 11.38s 77%

https://github.com/lurk-lab/arecibo/pull/144

rc Test Base (load) large vector allocations + previous %
100 Fibonacci(100) 7.47s 5.71s 76%
Fibonacci(200) 14.89s 11.32s 76%
600 Fibonacci(100) 6.34s 4.98s 78%
Fibonacci(200) 14.65s 10.50s 71%

Results

Significant improvements are not anticipated when comparing with our "fresh" scenario, as the comparison is made against a scenario that operates without memory pressure. Consequently, this scenario does not experience the drawbacks of an architecture struggling to cope with such pressure. However, the significant improvement compared to our "load" scenario is strong evidence that the memory pipeline is much more stable.

srinathsetty commented 5 months ago

Thanks for adding all the context! This is a very important optimization in the big picture. We would like to eventually get this feature merged.

One thing that we need to get better clarity is how this kind of optimization will impact future code changes. For example, in some instances, this inherently requires the caller to allocate and manage memory of the callee.

Have you considered making this to be based on traits? For example, in the current PR (assuming I understand the full code correctly), pub struct ResourceBuffer<E: Engine> (defined in lib.rs) seems to be very much tied to the particulars of nifs.rs. Is there a pattern in which we can have nifs.rs define certain types that lib.rs can use as a black box (with certain trait methods)?

The reason to bring up this design is in the future, we might abstract away nifs.fs with a trait so that lib.rs knows nothing about the internal details.