Feat: Distributed proof generation

At the moment, zkVMs are somewhat too slow for real-time proving on a single computer. This is especially the case for Risc0 which requires dozens of GPUs.

Unfortunately, on most cloud providers scaling is significantly easier by adding more machines, or it might be the only way to scale as for example on AWS, we cannot get more than 8 GPUs per instance.

Hence we need distributed computing support on Raiko.

zkVMs are all in the process of adding continuations (Risc0, Powdr) / Sharding (SP1) / Chunking (Halo2), hence an easy way to distribute compute would be to have a master prover with a fleet of workers. The master interacts with the end-user and delegates work to workers.

GPU Cloud pricing analysis

We need 16GB or 24GB VRAM. P40 have a 2016 architecture and V100 a 2019 architecture

EC2 P3 have V100 but 16GB and more expensive so non-starter

P4d with A100 (2021) and 80GB Vram are non-starter due to price

EC2 G5 with A10G fits our need but limited to 8

Risc0

Distributing Risc0 requires sending work remotely and calling prove_segment on the worker then sending back the result.

Note: Risc0 only supports 1 GPU per machine (GPU 0):

https://github.com/risc0/risc0/blob/461c57b/risc0/zkp/src/hal/cuda.rs#L48-L50

Once the stark proof is generated, it needs to be wrapped in Groth16 (automatically done when going through Bonsai). We should be able to use their compact_proof / stark2snark for this:

SP1

Distributing SP1 requires sending work remotely and calling prove_shard on the worker then sending back the result.

Note: SP1 supports delegating to a "Succinct Network" with protocol defined here: https://github.com/succinctlabs/sp1/blob/5db203c/sdk/src/proto/network.rs and RPC https://github.com/succinctlabs/sp1/blob/5db203c55647c30618431822d33f419614f9fab6/sdk/src/lib.rs#L104-L157 but this seems to only delegate work, not split work.

Wrapping stark proof in snark is WIP.

Maintenance considerations

We will likely need:

either a new trait "DistributedProver" with respectively prove_shard and prove_segment
or extend the existing traits with prove_shard_on_worker and prove_segment_on_worker

Maintenance to sync with upstream should be minimized.

cc @Champii @petarvujovic98 @CeciliaZ030

taikoxyz / raiko