Lurk is a Turing-complete programming language for recursive zk-SNARKs. It is a statically scoped dialect of Lisp, influenced by Scheme and Common Lisp.
Need for Diverse Benchmarking Infrastructure: We need a comprehensive benchmarking infrastructure that measures various aspects across different benchmarks, with an emphasis on developing shared vocabulary and terminology.
Next Steps: Develop and implement a diverse set of benchmarks and agree on common terminology for discussion.
Static Analysis of Performance: It’s highlighted that certain performance characteristics, like Lurk reduction algorithm efficiency, can be understood 'statically' based on code changes and don’t require testing on powerful machines.
Next Steps: Continue static analysis for algorithmic improvements and integrate findings into tests.
Real-world Performance Measurement: There’s a need to measure real-world performance of the proving pipeline, including aspects like proof time, throughput, hardware requirements, and parameter size.
Next Steps: Establish benchmarking infrastructure for real-world performance metrics and periodically validate these metrics.
Non-regression Testing: The importance of non-regression testing is emphasized to prevent unexpected performance issues due to changes in code or dependencies.
Next Steps: Implement non-regression tests to block merging if negative performance impact is detected. #790
FoldingConfig and NIVC Considerations: The introduction of NIVC necessitates considering performance variations and benchmarking different FoldingConfigs.
Next Steps: Experimentally determine the optimal FoldingConfig and develop benchmarks to measure their performance.
Older Brainstorm : What to benchmark?
We need some good benchmarks exposing performance of many parts of the system so we can:
optimize individual things we know are important (and measure)
discover how best to use the system as it is based on observed performance (what are good vs bad patterns)
track change over time and avoid accidental regressions, etc.
It would be good to have a general end-to-end benchmark that starts with Lurk source code (text), and ends with a successfully verified proof (first milestone: #378). We should then also have a corpus of 'interesting' input programs. This issue is about what the standard benchmark we run on each such program should look like:
We should make sure to measure each phase that happens with every proof:
creation of the store
Parsing/reading of the source into the store.
Hydration of the ScalarCache. Note: the split between this and the previous will depend on the program — since some programs need to (or may even if they don't need to) hash more eagerly.
Evaluation
circuit synthesis
witness generation (currently largely conflated with circuit synthesis)
other per-proof and per-reduction (evaluation/proof) steps like constant ptr and global allocation creation
Folding (more generally, the incremental per-circuit cost — which is 'folding' when using Nova but is proving of a single step circuit when using, e.g. Groth16).
Final SNARK (which is 'aggregation' with Groth16/SnarkPack+)
Verification
any parameter loading/generation for folding, final SNARK, and verification
saving of the store [not currently supported] — or at least those parts of it which are ephemeral and can't be regenerated (i.e. private data, new secrets and their relationship to commitments, etc.)
We should also measure the parts that can be factored out:
public parameter generation
prover key generation
verifier key generation
We should measure a range of parameters, though not necessarily every combination of all parameters, as the number grows. These should at least include:
Number of frames per step circuit (chunk_frame_count).
Backend
Final SNARK (currently one option for Nova, but there will be choices in the future)
Zero-knowledge yes/no (not a current option, but will be)
[many more in the future] — noted here so we can design for that.
GPU? [none | cuda | openCl ] / how many, what specs?
CPU — how many cores, how much RAM, etc.
Things we should measure for the above:
time
memory usage
cpu usage (total, but also allocation: are we making good use of multiple cores?)
data size (concrete size of things like cached parameters, proofs, initial store, final store, serialized/disk stores, etc.)
Benchmark Organization in 5 Points
Need for Diverse Benchmarking Infrastructure: We need a comprehensive benchmarking infrastructure that measures various aspects across different benchmarks, with an emphasis on developing shared vocabulary and terminology.
Static Analysis of Performance: It’s highlighted that certain performance characteristics, like Lurk reduction algorithm efficiency, can be understood 'statically' based on code changes and don’t require testing on powerful machines.
Real-world Performance Measurement: There’s a need to measure real-world performance of the proving pipeline, including aspects like proof time, throughput, hardware requirements, and parameter size.
Non-regression Testing: The importance of non-regression testing is emphasized to prevent unexpected performance issues due to changes in code or dependencies.
FoldingConfig and NIVC Considerations: The introduction of NIVC necessitates considering performance variations and benchmarking different FoldingConfigs.
Older Brainstorm : What to benchmark?
We need some good benchmarks exposing performance of many parts of the system so we can:
It would be good to have a general end-to-end benchmark that starts with Lurk source code (text), and ends with a successfully verified proof (first milestone: #378). We should then also have a corpus of 'interesting' input programs. This issue is about what the standard benchmark we run on each such program should look like:
We should make sure to measure each phase that happens with every proof:
We should also measure the parts that can be factored out:
We should measure a range of parameters, though not necessarily every combination of all parameters, as the number grows. These should at least include:
chunk_frame_count
).Things we should measure for the above:
time
memory usage
cpu usage (total, but also allocation: are we making good use of multiple cores?)
data size (concrete size of things like cached parameters, proofs, initial store, final store, serialized/disk stores, etc.)
number of hashes (per arity)
more fine-grained hashing (by type, etc.)
other fine-grained events in the store
[x] #378