penumbra-zone / webgpu

WebGPU-based Groth16 prover to accelerate client-side proof generation for the Penumbra Protocol
4 stars 1 forks source link

WebGPU baseline benchmarks #3

Closed TalDerei closed 1 year ago

TalDerei commented 1 year ago

Baseline Pippenger's Bucket Method MSM performance for the webgpu-msm reference repository from Demox-Labs.

TalDerei commented 1 year ago

Apple Mac M1 (2020) with 8 CPU and 8 GPU cores on MacOS 12 (update: MacOS 14 (Sonoma) yields a 25% performance gain). The results are measured in milliseconds. The memory saver and hardware acceleration options were enabled in the chrome browser. Nvidia 3090TI (Linux Ubuntu 22.04) was tested, but Webgpu is highly experimental on linux builds. It requires an unstable dev browser (since chrome canary unsupported on linux), and enabling extra experimental flags specific to the dev browser. Consequently, the performance was highly unstable.

webgpu-msm

Analysis

Setting the number of inputs too low (under 2^10) won’t generate meaningful results. The lower the number of inputs, the faster WASM (single threaded) will be, and the larger the performance variances between runs. As the input size slightly increases, the multi-threaded WASM (web-workers) outperforms the single-threaded WASM. When the input size grows sufficiently large (2^14), the perform gap between multi-threaded WASM and WebGPU closes, and WebGPU eventually overtakes in performance over both the naive and WASM implementations.

At 2^16 constraints, the Pippenger MSM implementation finally outperforms the naive MSM implementation. At this input size, the performance of the single and multi-threaded WASM are similiar and drops off a cliff (~2x slower than GPU variants). Interestingly, the multi-threaded WASM (using web-workers) and single-threaded WASM perform the same. Preliminary results indicate that on M2 Max, the multithreaded variant is about 40% faster than single threaded. Intuitively it should be faster and perhaps there's an issue with the multithreading in terms of at what level tasks as delegated. Overall, web workers won't give you the speedup you might expect as communication between web workers requires messaging passing which usually incurs some sort of serialization/deserialization cost.

It's worth noting that the input size is very important for the use of GPU/CPU as well as the particular algorithm used. Pippenger in particular has a large overhead, which is amortized for larger inputs, but it's possible Pippenger could be faster in every case as long as the parameters are set correctly. Demox-Labs currently uses the same 2^16 bucket size regardless input size. It would make sense to me that choosing a bucket size based on the input would lead to better results, particularly for smaller inputs.

Moreover, while the Naive WebGPU MSM implementation scales linearly in performance as the number of inputs increases, the Pippenger WebGPU MSM does not. Pippenger is not expected to achieve a nearly linear speedup in the execution time, where the speedup ratio is equal to the number of execution threads available on the GPU.

Benchmarks

Number of inputs: 2^10

Test name Run 1 Run 2 Run 3
Pippenger WebGPU MSM 576 560 553
Naive WebGPU MSM 792 790 782
Aleo Wasm: Single Thread 606 570 537
Aleo Wasm: Web Workers 1142 780 711

Number of inputs: 2^11

Test name Run 1 Run 2 Run 3
Pippenger WebGPU MSM 1096 1082 1081
Naive WebGPU MSM 952 955 942
Aleo Wasm: Single Thread 1141 1078 1085
Aleo Wasm: Web Workers 1239 1206 1211

Number of inputs: 2^12

Test name Run 1 Run 2 Run 3
Pippenger WebGPU MSM 2074 1994 2028
Naive WebGPU MSM 1812 1837 1820
Aleo Wasm: Single Thread 2177 2102 2087
Aleo Wasm: Web Workers 2269 2197 2189

Number of inputs: 2^13

Test name Run 1 Run 2 Run 3
Pippenger WebGPU MSM 4065 4005 3913
Naive WebGPU MSM 2966 2904 2931
Aleo Wasm: Single Thread 4197 4154 4123
Aleo Wasm: Web Workers 4167 4132 4113

Number of inputs: 2^14

Test name Run 1 Run 2 Run 3
Pippenger WebGPU MSM 7567 7296 7371
Naive WebGPU MSM 5663 5680 5654
Aleo Wasm: Single Thread 8223 8216 8205
Aleo Wasm: Web Workers 8222 8074 8019

Number of inputs: 2^15

Test name Run 1 Run 2 Run 3
Pippenger WebGPU MSM 13580 14274 13880
Naive WebGPU MSM 10793 10779 10747
Aleo Wasm: Single Thread 16290 16337 16291
Aleo Wasm: Web Workers 15649 15596 15538

Number of inputs: 2^16

Test name Run 1 Run 2 Run 3
Pippenger WebGPU MSM 24592 23538 23584
Naive WebGPU MSM 25438 25583 25406
Aleo Wasm: Single Thread 46724 46675 47138
Aleo Wasm: Web Workers 46036 45542 45194

Number of inputs: 2^17

Test name Run 1 Run 2 Run 3
Pippenger WebGPU MSM 36926 35647 34954
Naive WebGPU MSM 50429 50616 50652
Aleo Wasm: Single Thread 93764 92801 93489
Aleo Wasm: Web Workers 90028 90421 90618

Number of inputs: 2^18

Test name Run 1 Run 2 Run 3
Pippenger WebGPU MSM 47198 46529 46272
Naive WebGPU MSM 83084 83600 83167
Aleo Wasm: Single Thread 173216 - -
Aleo Wasm: Web Workers Timeout - -

Number of inputs: 2^19

Test name Run 1 Run 2 Run 3
Pippenger WebGPU MSM 63482 - -
Naive WebGPU MSM 165990 - -
Aleo Wasm: Single Thread 346806 - -
Aleo Wasm: Web Workers Timeout - -

Number of inputs: 2^20

Test name Run 1 Run 2 Run 3
Pippenger WebGPU MSM 99528 - -
Naive WebGPU MSM 332374 - -
Aleo Wasm: Single Thread 695256 - -
Aleo Wasm: Web Workers Timeout - -