WebGPU baseline benchmarks

Apple Mac M1 (2020) with 8 CPU and 8 GPU cores on MacOS 12 (update: MacOS 14 (Sonoma) yields a 25% performance gain). The results are measured in milliseconds. The memory saver and hardware acceleration options were enabled in the chrome browser. Nvidia 3090TI (Linux Ubuntu 22.04) was tested, but Webgpu is highly experimental on linux builds. It requires an unstable dev browser (since chrome canary unsupported on linux), and enabling extra experimental flags specific to the dev browser. Consequently, the performance was highly unstable.

webgpu-msm

Analysis

Setting the number of inputs too low (under 2^10) won’t generate meaningful results. The lower the number of inputs, the faster WASM (single threaded) will be, and the larger the performance variances between runs. As the input size slightly increases, the multi-threaded WASM (web-workers) outperforms the single-threaded WASM. When the input size grows sufficiently large (2^14), the perform gap between multi-threaded WASM and WebGPU closes, and WebGPU eventually overtakes in performance over both the naive and WASM implementations.

At 2^16 constraints, the Pippenger MSM implementation finally outperforms the naive MSM implementation. At this input size, the performance of the single and multi-threaded WASM are similiar and drops off a cliff (~2x slower than GPU variants). Interestingly, the multi-threaded WASM (using web-workers) and single-threaded WASM perform the same. Preliminary results indicate that on M2 Max, the multithreaded variant is about 40% faster than single threaded. Intuitively it should be faster and perhaps there's an issue with the multithreading in terms of at what level tasks as delegated. Overall, web workers won't give you the speedup you might expect as communication between web workers requires messaging passing which usually incurs some sort of serialization/deserialization cost.

It's worth noting that the input size is very important for the use of GPU/CPU as well as the particular algorithm used. Pippenger in particular has a large overhead, which is amortized for larger inputs, but it's possible Pippenger could be faster in every case as long as the parameters are set correctly. Demox-Labs currently uses the same 2^16 bucket size regardless input size. It would make sense to me that choosing a bucket size based on the input would lead to better results, particularly for smaller inputs.

Moreover, while the Naive WebGPU MSM implementation scales linearly in performance as the number of inputs increases, the Pippenger WebGPU MSM does not. Pippenger is not expected to achieve a nearly linear speedup in the execution time, where the speedup ratio is equal to the number of execution threads available on the GPU.

Benchmarks

Number of inputs: 2^10

Test name	Run 1	Run 2	Run 3
Pippenger WebGPU MSM	576	560	553
Naive WebGPU MSM	792	790	782
Aleo Wasm: Single Thread	606	570	537
Aleo Wasm: Web Workers	1142	780	711

Number of inputs: 2^11

Test name	Run 1	Run 2	Run 3
Pippenger WebGPU MSM	1096	1082	1081
Naive WebGPU MSM	952	955	942
Aleo Wasm: Single Thread	1141	1078	1085
Aleo Wasm: Web Workers	1239	1206	1211

Number of inputs: 2^12

Test name	Run 1	Run 2	Run 3
Pippenger WebGPU MSM	2074	1994	2028
Naive WebGPU MSM	1812	1837	1820
Aleo Wasm: Single Thread	2177	2102	2087
Aleo Wasm: Web Workers	2269	2197	2189

Number of inputs: 2^13

Test name	Run 1	Run 2	Run 3
Pippenger WebGPU MSM	4065	4005	3913
Naive WebGPU MSM	2966	2904	2931
Aleo Wasm: Single Thread	4197	4154	4123
Aleo Wasm: Web Workers	4167	4132	4113

Number of inputs: 2^14

Test name	Run 1	Run 2	Run 3
Pippenger WebGPU MSM	7567	7296	7371
Naive WebGPU MSM	5663	5680	5654
Aleo Wasm: Single Thread	8223	8216	8205
Aleo Wasm: Web Workers	8222	8074	8019

Number of inputs: 2^15

Test name	Run 1	Run 2	Run 3
Pippenger WebGPU MSM	13580	14274	13880
Naive WebGPU MSM	10793	10779	10747
Aleo Wasm: Single Thread	16290	16337	16291
Aleo Wasm: Web Workers	15649	15596	15538

Number of inputs: 2^16

Test name	Run 1	Run 2	Run 3
Pippenger WebGPU MSM	24592	23538	23584
Naive WebGPU MSM	25438	25583	25406
Aleo Wasm: Single Thread	46724	46675	47138
Aleo Wasm: Web Workers	46036	45542	45194

Number of inputs: 2^17

Test name	Run 1	Run 2	Run 3
Pippenger WebGPU MSM	36926	35647	34954
Naive WebGPU MSM	50429	50616	50652
Aleo Wasm: Single Thread	93764	92801	93489
Aleo Wasm: Web Workers	90028	90421	90618

Number of inputs: 2^18

Test name	Run 1	Run 2	Run 3
Pippenger WebGPU MSM	47198	46529	46272
Naive WebGPU MSM	83084	83600	83167
Aleo Wasm: Single Thread	173216	-	-
Aleo Wasm: Web Workers	Timeout	-	-

Number of inputs: 2^19

Test name	Run 1	Run 2	Run 3
Pippenger WebGPU MSM	63482	-	-
Naive WebGPU MSM	165990	-	-
Aleo Wasm: Single Thread	346806	-	-
Aleo Wasm: Web Workers	Timeout	-	-

Number of inputs: 2^20

Test name	Run 1	Run 2	Run 3
Pippenger WebGPU MSM	99528	-	-
Naive WebGPU MSM	332374	-	-
Aleo Wasm: Single Thread	695256	-	-
Aleo Wasm: Web Workers	Timeout	-	-

penumbra-zone / webgpu