Closed weijiekoh closed 8 months ago
This PR works with the cloud Mac Minis!
MSM size | 1st run | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Average (incl 1st) | Average (excl 1st) |
---|---|---|---|---|---|---|---|---|
2^16 | 9242 |
1422 |
1382 |
1374 |
1371 |
2659 |
2908 |
1642 |
2^17 | 1942 |
1770 |
3086 |
1794 |
1777 |
3045 |
2236 |
2294 |
2^18 | 2663 |
3948 |
3940 |
3926 |
2672 |
3922 |
3512 |
3682 |
2^19 | 5499 |
5599 |
6002 |
5577 |
5570 |
5636 |
5647 |
5677 |
2^20 | 10310 |
10394 |
10505 |
10415 |
10468 |
10403 |
10416 |
10437 |
Note that shaders were forced to recompile only for MSM size 2^16
MSM size | 1st run | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Average (incl 1st) | Average (excl 1st) |
---|---|---|---|---|---|---|---|---|
2^16 | 7661 |
1153 |
1146 |
1138 |
1362 |
1140 |
2267 |
1188 |
2^17 | 1634 |
1683 |
1495 |
1477 |
1620 |
1480 |
1565 |
1551 |
2^18 | 2295 |
2305 |
2155 |
2304 |
2362 |
2349 |
2295 |
2295 |
2^19 | 3652 |
3705 |
3623 |
3635 |
3645 |
3635 |
3649 |
3649 |
2^20 | 6557 |
6478 |
6457 |
6453 |
6485 |
6443 |
6479 |
6463 |
Note that shaders were forced to recompile only for MSM size 2^16
This looks great! I recalled a note from the first time we met Evan: Increasing the depth of the WGSL shader code decreases the overall speedup.
It seems like for 2^16 inputs
on M1 on v123
, splitting the shader results in slightly longer compilation time, but the overall runtime is faster!
2^16:
first run: 3600 ms
second run: 1195 ms
2^17:
first run: 1559 ms
second run: 1568 ms
2^18:
first run: 2272 ms
second run: 2279 ms
2^19:
first run: 3802 ms
second run: 3765 ms
2^20:
first run: 7817 ms
second run: 7774 ms
2^16:
first run: 4102 ms
second run: 1116 ms
2^17:
first run: 1684 ms
second run: 1501 ms
2^18:
first run: 2198 ms
second run: 2310 ms
2^19:
first run: 3697 ms
second run: 3870 ms
2^20:
first run: 7280 ms
second run: 7269 ms
@weijiekoh I tested this PR against chrome v115
(July, 2023) per #120 , and we need to discuss allowing v120 (December, 2023) for the competition (and I don't think it's too late given the deadline extension). Performance on v115
vs v123
on M1 and M3 is ~20% slower when the inputs get sufficiently larger passed 2^17
inputs. If we assume there are X competitors, and half target their solution for wasm, then using a new chrome version can only benefit webgpu implementations since our solution will be faster compared to the other wasm solutions. This is a consideration we forgot to take into account and should discuss.
This PR works for both M1 and M3 on both chrome v115 and v123
! This seems to resolve #104
I'm not seeing these slowdowns comparing v120 to v123
versus
comparing v115 to v120/v123 on both M1 and M3.
The indexing seems to be optimized and correct. LGTM!
References #121. The original BPR shader performed a series of point additions, followed by scalar multiplication. This resulted in a rather large shader that some platforms (Mac Minis - M1 2020 and M2 2023) could not run. By splitting the shader into two - where the first shader computes the
m
andg
points, and the second performs the final scalar multiplication and point addition, we hope to make it work on said platforms. Ideally, this would allow it to also work on the M3.The debug mode for the second part is still buggy, but the final result is correct. TBD.
Benchmarks
The tradeoff is that there is a small slowdown due to the overhead of an extra shader.
The initial compilation time, however, is 2s faster.
Performance of this PR (commit
90e820b
)21983
1285
1201
1213
1362
1245
4715
1261
1584
1795
1546
1595
1632
1546
1616
1623
2505
2456
2554
2437
2605
2519
2513
2514
3884
3959
3831
3955
3904
4009
3924
3932
6786
6871
6824
6806
6918
6811
6836
6846
Note that shaders were forced to recompile only for MSM size 2^16
Performance before splitting the BPR shader (commit
4495fbe6
)23779
1085
1181
1170
1170
1122
4918
1146
1543
2082
1410
1480
1422
1485
1570
1576
2412
2291
2392
2172
2374
2545
2364
2355
3856
3760
3714
3828
3730
3860
3791
3778
6665
6685
6636
6686
6572
6593
6640
6634
Note that shaders were forced to recompile only for MSM size 2^16