Remainder checklist - Githubissues

TalDerei commented 10 months ago

Housekeeping:

[x] Enable public read access to the codebase on March 14th
[x] Enable public read access to the benchmarks spreadsheet on March 14th
[x] Enable public read access to documentation on March 14th
[x] Comprehensive documentation in Readme https://hackmd.io/HNH0DcSqSka4hAaIfJNHEA
[x] #135
[x] Review WASM codebase (https://github.com/mitschabaude/montgomery) for optimization ideas
[x] Investigate load imbalance for nonuniform inputs compared with uniformly distributed inputs.
[x] Optimize storage buffer usage by profiling memory usage
[x] Investigate scalar chunk sizes greater than 16 https://github.com/TalDerei/webgpu-msm/issues/47
[x] Lower level shader optimizations (eg. Refactor to pass constants like p as pointers)
[x] #104
[x] Create a diagram that visually explains the end-to-end flow https://app.diagrams.net/#G1gKe37GI3oT-H-gZyKa7s4yOo2QVI-kuK
[x] Testing in production build #10
[x] Fix naming: is SMVP actually bucket accumulation and is the tree-summation step actually bucket summation or reduction? The docs and the code should use the same terminology.
[x] Implement SMVP shader logic in end-to-end flow
[x] Implement multiply operation after SMVP (before Horner's Rule reduction)
[x] Handle OOMs for 2^19 / 2^20 inputs by refactoring / splitting up buffer layout
[x] Determine the optimal chunk size as a function of the input-size at runtime
[x] Remove unused code (WASM and Web-Worker) in main branch of submission #93
[x] Extraneous templating engine considerations + naga hack
[x] Shader organization
[x] Return single point (320 bytes) from aggregate_buckets
[x] Implement and benchmark running sum vs tree summation method
[x] Optimisation: combine convert_point_coords.template.wgsl and decompose_scalars.template.wgsl
[x] Implement signed bucket indices for cuZK (https://hackmd.io/EKApBmBGS6iG-V3y11dREw)
[x] Optimization: Perform pre-computation for all subtasks in a single shader (a big task)
[x] Ensure that input sizes like 2 ** 16 + 1 work (no need - fulltimemike said inputs will be powers of 2)
[x] A one-click benchmarking suite for all input sizes and variables such as chunk sizes, workgroup sizes, etc
[x] Investigate exceeding the workgroup size limits does not cause a failure.
[x] Create submission branch https://github.com/TalDerei/webgpu-msm/pull/110

TalDerei commented 10 months ago

Open-Questions:

[x] Determine if precomputation actually neccessary in reducing the runtime in the context of a performant SMVP step?
[x] Figure out why our expected runtime in the end-to-end is much slower than the individual steps.
[x] Can the number of buffers (or more generally the complexity of the pipeline's construction) affect the execution of the same shader? Update: The number of threads launched according to the device limits / capabilities affects the shader execution.
[x] Determine what is the max number of GPU threads on M1 / NVIDIA GPUs? #7
[x] Investigate the sparsity of the array inputs into bucket aggregation method?
[x] Verify that output buffers are 2621520 x 16 bytes (upper limit 2621520 x 50 in theory?).

TalDerei commented 10 months ago

Shader Performance Improvements:

Currently, the shader's perform accordingly on the M1:

point conversion and scalar decomposition: 250ms
[x] transpose: 16 x 80ms. Parallelize transpose using multiple threads. (WJ)
[x] smvp: 14 x 70ms + 2 x 400ms. Figure out why some CSR matrices are slower. (WJ)
[x] bucket aggregation: 16 x 550ms. Resolve the bottleneck. The bucket aggregation is the bottleneck in the computation, and reducing it will improve performance by the largest margins.
[x] Try workgroup_size(x, y, z) on various shaders to see if locality / memory addressing improves performance. https://surma.dev/things/webgpu/
[x] (Approach 1: #89) Process multiple CSR matrices / compute shaders in parallel using web-workers. Reference buffer dependency chain diagram. (Tal)
[x] (Approach 2: #90) Modify indexing structure of SMVP and bucket aggregation to perform shader invocations in parallel. (WJ)
[x] Make the scalar decomp and point conversion shader accept INPUT_SIZE via a uniform buffer to avoid recompilation (WJ) https://github.com/TalDerei/webgpu-msm/pull/99
[x] Investigate Montgomery squaring (WJ + Tal) https://hackmd.io/@gnark/modular_multiplication#Montgomery-squaring
[x] Investigate the performance of 16-bit limbs (WJ) #98
[x] Since judges will always use input sizes which are powers of 2, the bucket sum shader doesn't need to check if the number of inputs is odd and use the point of infinity if so. (WJ)
[x] Get it to work on an old Nvidia card (WJ)

TalDerei commented 10 months ago

Benchmarking:

[x] Profile on Nvida GPUs (Nsight compute) to determine the throughout % and memory utilization for webgpu. Do this on a single CSR matrix, and the whole end-to-end. Profile Demox-Labs baseline and compare it our repository. https://github.com/TalDerei/webgpu-msm/issues/37

weijiekoh commented 8 months ago

[x] Implement support for Buffer inputs (WJ) #134
[x] Port changes to bls12-377 (Tal)
[x] Workshop slide: expand the Terms slide (1 slide for each term with examples) (WJ)
[x] Web workers for bigint to bytes conversion (Tal) (not necessary?)
[x] optimise bigint_to_u8_for_gpu (Tal) (not necessary?)
[ ] Floating points (Tal)
[x] implement support for Buffer inputs without formatting (WJ)
[x] Buffer input for BLS12-377

weijiekoh commented 8 months ago

[x] Input Montgomery mul timings for Apple M1: https://hackmd.io/HNH0DcSqSka4hAaIfJNHEA#Montgomery-multiplication

td-kwj-zp2023 / webgpu-msm-twisted-edwards

Remainder checklist #74