Closed TalDerei closed 8 months ago
Open-Questions:
2621520 x 16
bytes (upper limit 2621520 x 50
in theory?). Shader Performance Improvements:
Currently, the shader's perform accordingly on the M1:
point conversion and scalar decomposition: 250ms
[x] transpose: 16 x 80ms
. Parallelize transpose using multiple threads. (WJ)
[x] smvp: 14 x 70ms + 2 x 400ms
. Figure out why some CSR matrices are slower. (WJ)
[x] bucket aggregation: 16 x 550ms
. Resolve the bottleneck. The bucket aggregation is the bottleneck in the computation, and reducing it will improve performance by the largest margins.
[x] Try workgroup_size(x, y, z)
on various shaders to see if locality / memory addressing improves performance. https://surma.dev/things/webgpu/
[x] (Approach 1: #89) Process multiple CSR matrices / compute shaders in parallel using web-workers. Reference buffer dependency chain diagram. (Tal)
[x] (Approach 2: #90) Modify indexing structure of SMVP and bucket aggregation to perform shader invocations in parallel. (WJ)
[x] Make the scalar decomp and point conversion shader accept INPUT_SIZE
via a uniform buffer to avoid recompilation (WJ) https://github.com/TalDerei/webgpu-msm/pull/99
[x] Investigate Montgomery squaring (WJ + Tal) https://hackmd.io/@gnark/modular_multiplication#Montgomery-squaring
[x] Investigate the performance of 16-bit limbs (WJ) #98
[x] Since judges will always use input sizes which are powers of 2, the bucket sum shader doesn't need to check if the number of inputs is odd and use the point of infinity if so. (WJ)
[x] Get it to work on an old Nvidia card (WJ)
Benchmarking:
throughout % and memory utilization
for webgpu. Do this on a single CSR matrix, and the whole end-to-end. Profile Demox-Labs baseline and compare it our repository. https://github.com/TalDerei/webgpu-msm/issues/37
Housekeeping:
2^19 / 2^20
inputs by refactoring / splitting up buffer layoutconvert_point_coords.template.wgsl
anddecompose_scalars.template.wgsl