td-kwj-zp2023 / webgpu-msm-twisted-edwards

WebGPU MSM Implementation (Twisted Edwards Curve) for ZPrize 2023
7 stars 1 forks source link

WIP: split the bucket points reduction (BPR) shader into 2 #122

Closed weijiekoh closed 8 months ago

weijiekoh commented 9 months ago

References #121. The original BPR shader performed a series of point additions, followed by scalar multiplication. This resulted in a rather large shader that some platforms (Mac Minis - M1 2020 and M2 2023) could not run. By splitting the shader into two - where the first shader computes the m and g points, and the second performs the final scalar multiplication and point addition, we hope to make it work on said platforms. Ideally, this would allow it to also work on the M3.

The debug mode for the second part is still buggy, but the final result is correct. TBD.

Benchmarks

The tradeoff is that there is a small slowdown due to the overhead of an extra shader.

The initial compilation time, however, is 2s faster.

Performance of this PR (commit 90e820b)

MSM size 1st run Run 1 Run 2 Run 3 Run 4 Run 5 Average (incl 1st) Average (excl 1st)
2^16 21983 1285 1201 1213 1362 1245 4715 1261
2^17 1584 1795 1546 1595 1632 1546 1616 1623
2^18 2505 2456 2554 2437 2605 2519 2513 2514
2^19 3884 3959 3831 3955 3904 4009 3924 3932
2^20 6786 6871 6824 6806 6918 6811 6836 6846

Note that shaders were forced to recompile only for MSM size 2^16

Performance before splitting the BPR shader (commit 4495fbe6)

MSM size 1st run Run 1 Run 2 Run 3 Run 4 Run 5 Average (incl 1st) Average (excl 1st)
2^16 23779 1085 1181 1170 1170 1122 4918 1146
2^17 1543 2082 1410 1480 1422 1485 1570 1576
2^18 2412 2291 2392 2172 2374 2545 2364 2355
2^19 3856 3760 3714 3828 3730 3860 3791 3778
2^20 6665 6685 6636 6686 6572 6593 6640 6634

Note that shaders were forced to recompile only for MSM size 2^16

weijiekoh commented 9 months ago

This PR works with the cloud Mac Minis!

Mac Mini M1 2020

MSM size 1st run Run 1 Run 2 Run 3 Run 4 Run 5 Average (incl 1st) Average (excl 1st)
2^16 9242 1422 1382 1374 1371 2659 2908 1642
2^17 1942 1770 3086 1794 1777 3045 2236 2294
2^18 2663 3948 3940 3926 2672 3922 3512 3682
2^19 5499 5599 6002 5577 5570 5636 5647 5677
2^20 10310 10394 10505 10415 10468 10403 10416 10437

Note that shaders were forced to recompile only for MSM size 2^16

Mac Mini M2 2023

MSM size 1st run Run 1 Run 2 Run 3 Run 4 Run 5 Average (incl 1st) Average (excl 1st)
2^16 7661 1153 1146 1138 1362 1140 2267 1188
2^17 1634 1683 1495 1477 1620 1480 1565 1551
2^18 2295 2305 2155 2304 2362 2349 2295 2295
2^19 3652 3705 3623 3635 3645 3635 3649 3649
2^20 6557 6478 6457 6453 6485 6443 6479 6463

Note that shaders were forced to recompile only for MSM size 2^16

TalDerei commented 9 months ago

This looks great! I recalled a note from the first time we met Evan: Increasing the depth of the WGSL shader code decreases the overall speedup. It seems like for 2^16 inputs on M1 on v123, splitting the shader results in slightly longer compilation time, but the overall runtime is faster!

M1

Benchmarks BEFORE this PR (4495fbe6)

2^16:
first run: 3600 ms
second run: 1195 ms

2^17:
first run: 1559 ms
second run: 1568 ms

2^18:
first run: 2272 ms
second run: 2279 ms

2^19:
first run: 3802 ms
second run: 3765 ms

2^20:
first run: 7817 ms
second run: 7774 ms

Benchmarks FOR this PR (90e820b)

2^16:
first run:  4102 ms
second run: 1116 ms

2^17:
first run: 1684 ms
second run: 1501 ms

2^18:
first run: 2198 ms
second run: 2310 ms

2^19:
first run: 3697 ms
second run: 3870 ms

2^20:
first run: 7280 ms
second run: 7269 ms
TalDerei commented 9 months ago

@weijiekoh I tested this PR against chrome v115 (July, 2023) per #120 , and we need to discuss allowing v120 (December, 2023) for the competition (and I don't think it's too late given the deadline extension). Performance on v115 vs v123 on M1 and M3 is ~20% slower when the inputs get sufficiently larger passed 2^17 inputs. If we assume there are X competitors, and half target their solution for wasm, then using a new chrome version can only benefit webgpu implementations since our solution will be faster compared to the other wasm solutions. This is a consideration we forgot to take into account and should discuss.

TalDerei commented 9 months ago

This PR works for both M1 and M3 on both chrome v115 and v123! This seems to resolve #104

TalDerei commented 9 months ago

I'm not seeing these slowdowns comparing v120 to v123

versus

comparing v115 to v120/v123 on both M1 and M3.

TalDerei commented 9 months ago

The indexing seems to be optimized and correct. LGTM!