Closed TalDerei closed 1 year ago
Apple Mac M1 (2020) with 8 CPU and 8 GPU cores on MacOS 12 (update: MacOS 14 (Sonoma) yields a 25% performance gain). The results are measured in milliseconds. The memory saver and hardware acceleration options were enabled in the chrome browser. Nvidia 3090TI (Linux Ubuntu 22.04) was tested, but Webgpu is highly experimental on linux builds. It requires an unstable dev browser (since chrome canary unsupported on linux), and enabling extra experimental flags specific to the dev browser. Consequently, the performance was highly unstable.
Setting the number of inputs too low (under 2^10
) won’t generate meaningful results. The lower the number of inputs, the faster WASM (single threaded) will be, and the larger the performance variances between runs. As the input size slightly increases, the multi-threaded WASM (web-workers) outperforms the single-threaded WASM. When the input size grows sufficiently large (2^14
), the perform gap between multi-threaded WASM and WebGPU closes, and WebGPU eventually overtakes in performance over both the naive and WASM implementations.
At 2^16
constraints, the Pippenger MSM implementation finally outperforms the naive MSM implementation. At this input size, the performance of the single and multi-threaded WASM are similiar and drops off a cliff (~2x slower than GPU variants). Interestingly, the multi-threaded WASM (using web-workers) and single-threaded WASM perform the same. Preliminary results indicate that on M2 Max, the multithreaded variant is about 40% faster than single threaded. Intuitively it should be faster and perhaps there's an issue with the multithreading in terms of at what level tasks as delegated. Overall, web workers won't give you the speedup you might expect as communication between web workers requires messaging passing which usually incurs some sort of serialization/deserialization cost.
It's worth noting that the input size is very important for the use of GPU/CPU as well as the particular algorithm used. Pippenger in particular has a large overhead, which is amortized for larger inputs, but it's possible Pippenger could be faster in every case as long as the parameters are set correctly. Demox-Labs currently uses the same 2^16 bucket size regardless input size. It would make sense to me that choosing a bucket size based on the input would lead to better results, particularly for smaller inputs.
Moreover, while the Naive WebGPU MSM implementation scales linearly in performance as the number of inputs increases, the Pippenger WebGPU MSM does not. Pippenger is not expected to achieve a nearly linear speedup in the execution time, where the speedup ratio is equal to the number of execution threads available on the GPU.
Number of inputs: 2^10
Test name | Run 1 | Run 2 | Run 3 |
---|---|---|---|
Pippenger WebGPU MSM | 576 | 560 | 553 |
Naive WebGPU MSM | 792 | 790 | 782 |
Aleo Wasm: Single Thread | 606 | 570 | 537 |
Aleo Wasm: Web Workers | 1142 | 780 | 711 |
Number of inputs: 2^11
Test name | Run 1 | Run 2 | Run 3 |
---|---|---|---|
Pippenger WebGPU MSM | 1096 | 1082 | 1081 |
Naive WebGPU MSM | 952 | 955 | 942 |
Aleo Wasm: Single Thread | 1141 | 1078 | 1085 |
Aleo Wasm: Web Workers | 1239 | 1206 | 1211 |
Number of inputs: 2^12
Test name | Run 1 | Run 2 | Run 3 |
---|---|---|---|
Pippenger WebGPU MSM | 2074 | 1994 | 2028 |
Naive WebGPU MSM | 1812 | 1837 | 1820 |
Aleo Wasm: Single Thread | 2177 | 2102 | 2087 |
Aleo Wasm: Web Workers | 2269 | 2197 | 2189 |
Number of inputs: 2^13
Test name | Run 1 | Run 2 | Run 3 |
---|---|---|---|
Pippenger WebGPU MSM | 4065 | 4005 | 3913 |
Naive WebGPU MSM | 2966 | 2904 | 2931 |
Aleo Wasm: Single Thread | 4197 | 4154 | 4123 |
Aleo Wasm: Web Workers | 4167 | 4132 | 4113 |
Number of inputs: 2^14
Test name | Run 1 | Run 2 | Run 3 |
---|---|---|---|
Pippenger WebGPU MSM | 7567 | 7296 | 7371 |
Naive WebGPU MSM | 5663 | 5680 | 5654 |
Aleo Wasm: Single Thread | 8223 | 8216 | 8205 |
Aleo Wasm: Web Workers | 8222 | 8074 | 8019 |
Number of inputs: 2^15
Test name | Run 1 | Run 2 | Run 3 |
---|---|---|---|
Pippenger WebGPU MSM | 13580 | 14274 | 13880 |
Naive WebGPU MSM | 10793 | 10779 | 10747 |
Aleo Wasm: Single Thread | 16290 | 16337 | 16291 |
Aleo Wasm: Web Workers | 15649 | 15596 | 15538 |
Number of inputs: 2^16
Test name | Run 1 | Run 2 | Run 3 |
---|---|---|---|
Pippenger WebGPU MSM | 24592 | 23538 | 23584 |
Naive WebGPU MSM | 25438 | 25583 | 25406 |
Aleo Wasm: Single Thread | 46724 | 46675 | 47138 |
Aleo Wasm: Web Workers | 46036 | 45542 | 45194 |
Number of inputs: 2^17
Test name | Run 1 | Run 2 | Run 3 |
---|---|---|---|
Pippenger WebGPU MSM | 36926 | 35647 | 34954 |
Naive WebGPU MSM | 50429 | 50616 | 50652 |
Aleo Wasm: Single Thread | 93764 | 92801 | 93489 |
Aleo Wasm: Web Workers | 90028 | 90421 | 90618 |
Number of inputs: 2^18
Test name | Run 1 | Run 2 | Run 3 |
---|---|---|---|
Pippenger WebGPU MSM | 47198 | 46529 | 46272 |
Naive WebGPU MSM | 83084 | 83600 | 83167 |
Aleo Wasm: Single Thread | 173216 | - | - |
Aleo Wasm: Web Workers | Timeout | - | - |
Number of inputs: 2^19
Test name | Run 1 | Run 2 | Run 3 |
---|---|---|---|
Pippenger WebGPU MSM | 63482 | - | - |
Naive WebGPU MSM | 165990 | - | - |
Aleo Wasm: Single Thread | 346806 | - | - |
Aleo Wasm: Web Workers | Timeout | - | - |
Number of inputs: 2^20
Test name | Run 1 | Run 2 | Run 3 |
---|---|---|---|
Pippenger WebGPU MSM | 99528 | - | - |
Naive WebGPU MSM | 332374 | - | - |
Aleo Wasm: Single Thread | 695256 | - | - |
Aleo Wasm: Web Workers | Timeout | - | - |
Baseline Pippenger's Bucket Method MSM performance for the webgpu-msm reference repository from Demox-Labs.