Tweaks to the number of dispatched workgroups and other minor optimisations

td-kwj-zp2023 / webgpu-msm-twisted-edwards

WebGPU MSM Implementation (Twisted Edwards Curve) for ZPrize 2023

7 stars 1 forks source link

Tweaks to the number of dispatched workgroups and other minor optimisations #103

Closed weijiekoh closed 9 months ago

weijiekoh commented 10 months ago

Update the bucket reduction file to not hardcode the number of subtasks
Tweak the number of X and Y workgroups dispatched for point conversion
Optimise the number of workgroups for bucket reduction
Pass in the number of Y and Z workgroups to SMVP via a uniform buffer
Optimise the number of workgroups for transpose
Only use a single commandEncoder
Removed console.logs in main process

With these tweaks, 2 ** 16 MSM is now sub-1s on my machine around half of the time.

It seems that setting the workgroup_size to 256 is the best strategy because this reduces the number of workgroup dispatches.

TalDerei commented 10 months ago

Interesting changes! This seems to run about the same on my M1 as the current main branch. Isn't the general advice to use 64-sized workgroups generally?

TalDerei commented 10 months ago

I'm still achieving ~1.7s for 2^16 inputs. If your timings suggest sub-second half the time, what's the other half roughly? And how's the general performance for 64-sized workgroups?

weijiekoh commented 10 months ago

With full workgroups:

2 ** 16:

1034ms, 1150ms, 1114ms, 987ms

Full benchmarks:

MSM size	1st run	Run 1	Run 2	Run 3	Run 4	Run 5	Average (incl 1st)	Average (excl 1st)
2^16	`23640`	`1212`	`1099`	`1020`	`1090`	`1174`	`4873`	`1119`
2^17	`1338`	`1461`	`1314`	`1496`	`1349`	`1450`	`1401`	`1414`
2^18	`2011`	`2028`	`2002`	`2055`	`2018`	`2019`	`2022`	`2024`
2^19	`3484`	`3574`	`3460`	`3595`	`3480`	`3521`	`3519`	`3526`
2^20	`6405`	`6529`	`6388`	`6433`	`6384`	`6712`	`6475`	`6489`

With smaller workgroups (64 for point conversion, 64 for SMVP, 1 for bucket reduction)

1187ms, 1211ms, 1226ms, 1208ms

Full benchmarks:

MSM size	1st run	Run 1	Run 2	Run 3	Run 4	Run 5	Average (incl 1st)	Average (excl 1st)
2^16	`23507`	`1117`	`1151`	`1186`	`1269`	`1155`	`4898`	`1176`
2^17	`1566`	`1813`	`1510`	`1566`	`1478`	`1674`	`1601`	`1608`
2^18	`2198`	`2164`	`2246`	`2170`	`2150`	`2270`	`2200`	`2200`
2^19	`3708`	`3682`	`3635`	`3848`	`3683`	`3676`	`3705`	`3705`
2^20	`6526`	`7390`	`6704`	`6853`	`6622`	`6629`	`6787`	`6840`

weijiekoh commented 10 months ago

Interesting changes! This seems to run about the same on my M1 as the current main branch. Isn't the general advice to use 64-sized workgroups generally?

Great question, the article does give this advice but my benchmarks show otherwise. Could you benchmark the following:

commit 44d38c6d76113499feaaeda3f7995a68e33b532a in the branch fix-num-workgroups
commit ae0b24a37d1f77f813bcc336a5e8c56485c2a4ec in the branch smaller-workgroups

TalDerei commented 9 months ago

Could you benchmark the following:

I must have missed this comment! For some reason, the full benchmark suite wasn't working. Let's merge, but there are a few terms I wanted to clarify.

1. maxComputeWorkgroupsPerDimension defines the maximum value for each dimension of a dispatch(x, y, z) operation, and defaults to 2^16 - 1 (65,535) threads per dimension. The theoretical maximum number of threads the API is capable of launching is (2^16 - 1 threads in x-dimension) (2^16 - 1 threads in y-dimension) (2^16 - 1 threads in z-dimension). For instance, launching 2^16 threads in the x-dimension will throw “Dispatch workgroup count X (65536) exceeds max compute workgroups per dimension (65535).”, and 2^16 - 1 threads freezes my machine.

Similarly in CUDA, "A grid as we have seen can be quite large (more than quintillions of threads: 9,444,444,733,164,249,676,800 threads maximum, exactly. No current CUDA GPU has the memory space or address space to support that number of elements in a dataset. That maximum number exceeds 2^64 by several orders of magnitude)." In the same way a machine process has seemingly "full" access the the machines memory via virtual memory addressing, the CUDA API enables seemingly full and "infinite" access to threads independent of the underlying GPUs capability / limits.

2. maxComputeInvocationsPerWorkgroup defines the maximum value of the product of the workgroup_size dimensions x, y, and z dimensions, and defaults to 256. The maximum workgroup_size for x-dimension is 256, y-dimension is 256, and z-dimension is 64, but their product might exceed maxComputeInvocationsPerWorkgroup. Based on this article, there seems to be a negligible difference between @workgroup_size(256) or @workgroup_size(128, 2) or @workgroup_size(64, 2, 2).

The current bulk of the runtime occurs during the SMVP stage, where we're launching 64x * 2y * 16z * 256 workgroup_size = 524,288 threads. Here's a comparison of a top-grade server Nvidia GPU (A10) for context:

There must be some internal queue in WebGPU, as you suggested, that's executing chunks of 524,288 threads in lockstep since the underlying GPU devices clearly cannot handle that thread load.

Currently, we're doing

    await smvp_gpu(
        smvp_shader,
        s_num_x_workgroups,
        s_num_y_workgroups,
        s_num_z_workgroups,
        device,
        commandEncoder,
        num_subtasks,
        num_columns,
        input_size,
        chunk_size,
        all_csc_col_ptr_sb,
        point_x_sb,
        point_y_sb,
        all_csc_val_idxs_sb,
        bucket_sum_x_sb,
        bucket_sum_y_sb,
        bucket_sum_t_sb,
        bucket_sum_z_sb,
    )

which takes X amount of time. Reducing the s_num_x_workgroups by half to s_num_x_workgroups / 2 reduces the execution time by 500ms (so clearly the number of threads being launched is the primary factor that affects performance more than anything else). Running the smvp_gpu shader twice with s_num_x_workgroups / 2 seems to take as long as a single shader invocation with s_num_x_workgroups threads.

This preliminarily suggests that splitting up the execution of the SMVP / bucket reduction shader into multiple separate calls might actually yield similar performance.