mratsim / constantine

Constantine: modular, high-performance, zero-dependency cryptography stack for verifiable computation, proof systems and blockchain protocols.
Other
388 stars 43 forks source link

ZKML: Proving Machine Learning #237

Open mratsim opened 1 year ago

mratsim commented 1 year ago

Writeup on prover for machine learning

mratsim commented 1 year ago

See research on snark efficient NN: https://hackmd.io/ewbgAyx-SfKBHkHUNyp_Og

mratsim commented 1 year ago

Self-contained repo that might be interesting to prove: https://github.com/karpathy/llama2.c

mratsim commented 10 months ago

https://medium.com/@danieldkang/tensorplonk-a-gpu-for-zkml-delivering-1-000x-speedups-d1ab0ad27e1c

TensorPlonk: 1,000x faster proving

To accelerate ML model proving, we built TensorPlonk. TensorPlonk optimizes matrix multiplications, non-linearities, and weight commitments with new optimizations and by extending recent work.

The first part of TensorPlonk accelerates matrix multiplications. We leverage recent work by Eagen & Gabizon called cqlin, which can prove the result of matrix multiplication in O(n) time, if one of the matrices is fixed ahead of time. Luckily, this is the case for many ML models, like the Twitter model.

However, naively using cqlin would require large proofs and excessive work for the verifier. Technically, it would require sending a commitment per matrix multiplication, and the verifier would need to do a pairing check per matrix multiplication. Doing so would increase the proof size and verifier work by nearly 10x compared to our prior work. Large proofs and increased verification times would make ZKML infeasible for blockchain applications.

To solve this problem, we can use a randomized check to verify all of the matrix multiplications within a model at once! Stay tuned for our technical report, which we will release in the coming weeks.

The second part of TensorPlonk accelerates the non-linearities. To understand why we need this, existing work in ZKML that uses lookups uses a lookup proving argument called plookup. The proof is split into a “circuit” and “lookup tables.” With plookup, the circuit must be the size of the table, so if we use a large table, we must use a large circuit. With our optimizations, the lookup table can be as large as 2²⁴ elements but the circuit can be as small as 2¹⁶ elements! This means we would have an extra 250x overhead to use a large lookup table for the non-linearities.

Instead, we can leverage recent work called cq, which allows for lookup tables of size independent of the circuit size. We further optimized cq to “batch” work across lookup tables, which can reduce the computations for the lookups by up to 3x.

Our third optimization optimizes the weight commitments, which binds the model provider to use a fixed set of weights. Instead of using a hash, which can be extremely expensive, we use a KZG commitment of the weight vectors used in cqlin. This can reduce the proving time by up to 10x.

mratsim commented 1 month ago

Latest advances in LLM show that 1-bit LLMs are possible hence focusing on bit optimizations (Binius?) might prove extremely fruitful (pun not intended)

20240802_084338.jpg

https://arxiv.org/abs/2402.17764