WIP: split the bucket points reduction (BPR) shader into 2

weijiekoh commented 9 months ago

References #121. The original BPR shader performed a series of point additions, followed by scalar multiplication. This resulted in a rather large shader that some platforms (Mac Minis - M1 2020 and M2 2023) could not run. By splitting the shader into two - where the first shader computes the m and g points, and the second performs the final scalar multiplication and point addition, we hope to make it work on said platforms. Ideally, this would allow it to also work on the M3.

The debug mode for the second part is still buggy, but the final result is correct. TBD.

Benchmarks

The tradeoff is that there is a small slowdown due to the overhead of an extra shader.

The initial compilation time, however, is 2s faster.

Performance of this PR (commit `90e820b`)

MSM size	1st run	Run 1	Run 2	Run 3	Run 4	Run 5	Average (incl 1st)	Average (excl 1st)
2^16	`21983`	`1285`	`1201`	`1213`	`1362`	`1245`	`4715`	`1261`
2^17	`1584`	`1795`	`1546`	`1595`	`1632`	`1546`	`1616`	`1623`
2^18	`2505`	`2456`	`2554`	`2437`	`2605`	`2519`	`2513`	`2514`
2^19	`3884`	`3959`	`3831`	`3955`	`3904`	`4009`	`3924`	`3932`
2^20	`6786`	`6871`	`6824`	`6806`	`6918`	`6811`	`6836`	`6846`

Note that shaders were forced to recompile only for MSM size 2^16

Performance before splitting the BPR shader (commit `4495fbe6`)

MSM size	1st run	Run 1	Run 2	Run 3	Run 4	Run 5	Average (incl 1st)	Average (excl 1st)
2^16	`23779`	`1085`	`1181`	`1170`	`1170`	`1122`	`4918`	`1146`
2^17	`1543`	`2082`	`1410`	`1480`	`1422`	`1485`	`1570`	`1576`
2^18	`2412`	`2291`	`2392`	`2172`	`2374`	`2545`	`2364`	`2355`
2^19	`3856`	`3760`	`3714`	`3828`	`3730`	`3860`	`3791`	`3778`
2^20	`6665`	`6685`	`6636`	`6686`	`6572`	`6593`	`6640`	`6634`

Note that shaders were forced to recompile only for MSM size 2^16

weijiekoh commented 9 months ago

This PR works with the cloud Mac Minis!

Mac Mini M1 2020

MSM size	1st run	Run 1	Run 2	Run 3	Run 4	Run 5	Average (incl 1st)	Average (excl 1st)
2^16	`9242`	`1422`	`1382`	`1374`	`1371`	`2659`	`2908`	`1642`
2^17	`1942`	`1770`	`3086`	`1794`	`1777`	`3045`	`2236`	`2294`
2^18	`2663`	`3948`	`3940`	`3926`	`2672`	`3922`	`3512`	`3682`
2^19	`5499`	`5599`	`6002`	`5577`	`5570`	`5636`	`5647`	`5677`
2^20	`10310`	`10394`	`10505`	`10415`	`10468`	`10403`	`10416`	`10437`

Note that shaders were forced to recompile only for MSM size 2^16

Mac Mini M2 2023

MSM size	1st run	Run 1	Run 2	Run 3	Run 4	Run 5	Average (incl 1st)	Average (excl 1st)
2^16	`7661`	`1153`	`1146`	`1138`	`1362`	`1140`	`2267`	`1188`
2^17	`1634`	`1683`	`1495`	`1477`	`1620`	`1480`	`1565`	`1551`
2^18	`2295`	`2305`	`2155`	`2304`	`2362`	`2349`	`2295`	`2295`
2^19	`3652`	`3705`	`3623`	`3635`	`3645`	`3635`	`3649`	`3649`
2^20	`6557`	`6478`	`6457`	`6453`	`6485`	`6443`	`6479`	`6463`

Note that shaders were forced to recompile only for MSM size 2^16

TalDerei commented 9 months ago

This looks great! I recalled a note from the first time we met Evan: Increasing the depth of the WGSL shader code decreases the overall speedup. It seems like for 2^16 inputs on M1 on v123, splitting the shader results in slightly longer compilation time, but the overall runtime is faster!

M1

Benchmarks BEFORE this PR (4495fbe6)

2^16:
first run: 3600 ms
second run: 1195 ms

2^17:
first run: 1559 ms
second run: 1568 ms

2^18:
first run: 2272 ms
second run: 2279 ms

2^19:
first run: 3802 ms
second run: 3765 ms

2^20:
first run: 7817 ms
second run: 7774 ms

Benchmarks FOR this PR (90e820b)

2^16:
first run:  4102 ms
second run: 1116 ms

2^17:
first run: 1684 ms
second run: 1501 ms

2^18:
first run: 2198 ms
second run: 2310 ms

2^19:
first run: 3697 ms
second run: 3870 ms

2^20:
first run: 7280 ms
second run: 7269 ms

TalDerei commented 9 months ago

@weijiekoh I tested this PR against chrome v115 (July, 2023) per #120 , and we need to discuss allowing v120 (December, 2023) for the competition (and I don't think it's too late given the deadline extension). Performance on v115 vs v123 on M1 and M3 is ~20% slower when the inputs get sufficiently larger passed 2^17 inputs. If we assume there are X competitors, and half target their solution for wasm, then using a new chrome version can only benefit webgpu implementations since our solution will be faster compared to the other wasm solutions. This is a consideration we forgot to take into account and should discuss.

TalDerei commented 9 months ago

This PR works for both M1 and M3 on both chrome v115 and v123! This seems to resolve #104

TalDerei commented 9 months ago

I'm not seeing these slowdowns comparing v120 to v123

versus

comparing v115 to v120/v123 on both M1 and M3.

TalDerei commented 9 months ago

The indexing seems to be optimized and correct. LGTM!

td-kwj-zp2023 / webgpu-msm-twisted-edwards