td-kwj-zp2023 / webgpu-msm-twisted-edwards

WebGPU MSM Implementation (Twisted Edwards Curve) for ZPrize 2023
7 stars 1 forks source link

Tweaks to the number of dispatched workgroups and other minor optimisations #103

Closed weijiekoh closed 9 months ago

weijiekoh commented 10 months ago

With these tweaks, 2 ** 16 MSM is now sub-1s on my machine around half of the time.

It seems that setting the workgroup_size to 256 is the best strategy because this reduces the number of workgroup dispatches.

TalDerei commented 10 months ago

Interesting changes! This seems to run about the same on my M1 as the current main branch. Isn't the general advice to use 64-sized workgroups generally?

Screenshot 2024-01-14 at 8 43 42 AM
TalDerei commented 10 months ago

I'm still achieving ~1.7s for 2^16 inputs. If your timings suggest sub-second half the time, what's the other half roughly? And how's the general performance for 64-sized workgroups?

weijiekoh commented 10 months ago

With full workgroups:

2 ** 16:

1034ms, 1150ms, 1114ms, 987ms

Full benchmarks:

MSM size 1st run Run 1 Run 2 Run 3 Run 4 Run 5 Average (incl 1st) Average (excl 1st)
2^16 23640 1212 1099 1020 1090 1174 4873 1119
2^17 1338 1461 1314 1496 1349 1450 1401 1414
2^18 2011 2028 2002 2055 2018 2019 2022 2024
2^19 3484 3574 3460 3595 3480 3521 3519 3526
2^20 6405 6529 6388 6433 6384 6712 6475 6489

With smaller workgroups (64 for point conversion, 64 for SMVP, 1 for bucket reduction)

1187ms, 1211ms, 1226ms, 1208ms

Full benchmarks:

MSM size 1st run Run 1 Run 2 Run 3 Run 4 Run 5 Average (incl 1st) Average (excl 1st)
2^16 23507 1117 1151 1186 1269 1155 4898 1176
2^17 1566 1813 1510 1566 1478 1674 1601 1608
2^18 2198 2164 2246 2170 2150 2270 2200 2200
2^19 3708 3682 3635 3848 3683 3676 3705 3705
2^20 6526 7390 6704 6853 6622 6629 6787 6840
weijiekoh commented 10 months ago

Interesting changes! This seems to run about the same on my M1 as the current main branch. Isn't the general advice to use 64-sized workgroups generally?

Great question, the article does give this advice but my benchmarks show otherwise. Could you benchmark the following:

TalDerei commented 9 months ago

Could you benchmark the following:

I must have missed this comment! For some reason, the full benchmark suite wasn't working. Let's merge, but there are a few terms I wanted to clarify.

1. maxComputeWorkgroupsPerDimension defines the maximum value for each dimension of a dispatch(x, y, z) operation, and defaults to 2^16 - 1 (65,535) threads per dimension. The theoretical maximum number of threads the API is capable of launching is (2^16 - 1 threads in x-dimension) (2^16 - 1 threads in y-dimension) (2^16 - 1 threads in z-dimension). For instance, launching 2^16 threads in the x-dimension will throw “Dispatch workgroup count X (65536) exceeds max compute workgroups per dimension (65535).”, and 2^16 - 1 threads freezes my machine.

Similarly in CUDA, "A grid as we have seen can be quite large (more than quintillions of threads: 9,444,444,733,164,249,676,800 threads maximum, exactly. No current CUDA GPU has the memory space or address space to support that number of elements in a dataset. That maximum number exceeds 2^64 by several orders of magnitude)." In the same way a machine process has seemingly "full" access the the machines memory via virtual memory addressing, the CUDA API enables seemingly full and "infinite" access to threads independent of the underlying GPUs capability / limits.

2. maxComputeInvocationsPerWorkgroup defines the maximum value of the product of the workgroup_size dimensions x, y, and z dimensions, and defaults to 256. The maximum workgroup_size for x-dimension is 256, y-dimension is 256, and z-dimension is 64, but their product might exceed maxComputeInvocationsPerWorkgroup. Based on this article, there seems to be a negligible difference between @workgroup_size(256) or @workgroup_size(128, 2) or @workgroup_size(64, 2, 2).


The current bulk of the runtime occurs during the SMVP stage, where we're launching 64x * 2y * 16z * 256 workgroup_size = 524,288 threads. Here's a comparison of a top-grade server Nvidia GPU (A10) for context:

Screenshot 2024-01-19 at 11 35 29 AM

There must be some internal queue in WebGPU, as you suggested, that's executing chunks of 524,288 threads in lockstep since the underlying GPU devices clearly cannot handle that thread load.

Currently, we're doing

    await smvp_gpu(
        smvp_shader,
        s_num_x_workgroups,
        s_num_y_workgroups,
        s_num_z_workgroups,
        device,
        commandEncoder,
        num_subtasks,
        num_columns,
        input_size,
        chunk_size,
        all_csc_col_ptr_sb,
        point_x_sb,
        point_y_sb,
        all_csc_val_idxs_sb,
        bucket_sum_x_sb,
        bucket_sum_y_sb,
        bucket_sum_t_sb,
        bucket_sum_z_sb,
    )

which takes X amount of time. Reducing the s_num_x_workgroups by half to s_num_x_workgroups / 2 reduces the execution time by 500ms (so clearly the number of threads being launched is the primary factor that affects performance more than anything else). Running the smvp_gpu shader twice with s_num_x_workgroups / 2 seems to take as long as a single shader invocation with s_num_x_workgroups threads.

This preliminarily suggests that splitting up the execution of the SMVP / bucket reduction shader into multiple separate calls might actually yield similar performance.