Closed weijiekoh closed 9 months ago
Interesting changes! This seems to run about the same on my M1 as the current main branch. Isn't the general advice to use 64-sized workgroups generally?
I'm still achieving ~1.7s for 2^16
inputs. If your timings suggest sub-second half the time, what's the other half roughly? And how's the general performance for 64-sized workgroups?
With full workgroups:
2 ** 16:
1034ms, 1150ms, 1114ms, 987ms
Full benchmarks:
MSM size | 1st run | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Average (incl 1st) | Average (excl 1st) |
---|---|---|---|---|---|---|---|---|
2^16 | 23640 |
1212 |
1099 |
1020 |
1090 |
1174 |
4873 |
1119 |
2^17 | 1338 |
1461 |
1314 |
1496 |
1349 |
1450 |
1401 |
1414 |
2^18 | 2011 |
2028 |
2002 |
2055 |
2018 |
2019 |
2022 |
2024 |
2^19 | 3484 |
3574 |
3460 |
3595 |
3480 |
3521 |
3519 |
3526 |
2^20 | 6405 |
6529 |
6388 |
6433 |
6384 |
6712 |
6475 |
6489 |
With smaller workgroups (64 for point conversion, 64 for SMVP, 1 for bucket reduction)
1187ms, 1211ms, 1226ms, 1208ms
Full benchmarks:
MSM size | 1st run | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Average (incl 1st) | Average (excl 1st) |
---|---|---|---|---|---|---|---|---|
2^16 | 23507 |
1117 |
1151 |
1186 |
1269 |
1155 |
4898 |
1176 |
2^17 | 1566 |
1813 |
1510 |
1566 |
1478 |
1674 |
1601 |
1608 |
2^18 | 2198 |
2164 |
2246 |
2170 |
2150 |
2270 |
2200 |
2200 |
2^19 | 3708 |
3682 |
3635 |
3848 |
3683 |
3676 |
3705 |
3705 |
2^20 | 6526 |
7390 |
6704 |
6853 |
6622 |
6629 |
6787 |
6840 |
Interesting changes! This seems to run about the same on my M1 as the current main branch. Isn't the general advice to use 64-sized workgroups generally?
Great question, the article does give this advice but my benchmarks show otherwise. Could you benchmark the following:
44d38c6d76113499feaaeda3f7995a68e33b532a
in the branch fix-num-workgroups
ae0b24a37d1f77f813bcc336a5e8c56485c2a4ec
in the branch smaller-workgroups
Could you benchmark the following:
I must have missed this comment! For some reason, the full benchmark suite wasn't working. Let's merge, but there are a few terms I wanted to clarify.
1. maxComputeWorkgroupsPerDimension
defines the maximum value for each dimension of a dispatch(x, y, z) operation, and defaults to 2^16 - 1 (65,535)
threads per dimension. The theoretical maximum number of threads the API is capable of launching is (2^16 - 1 threads in x-dimension) (2^16 - 1 threads in y-dimension) (2^16 - 1 threads in z-dimension). For instance, launching 2^16 threads in the x-dimension will throw “Dispatch workgroup count X (65536) exceeds max compute workgroups per dimension (65535).”
, and 2^16 - 1
threads freezes my machine.
Similarly in CUDA, "A grid as we have seen can be quite large (more than quintillions of threads: 9,444,444,733,164,249,676,800 threads maximum, exactly. No current CUDA GPU has the memory space or address space to support that number of elements in a dataset. That maximum number exceeds 2^64 by several orders of magnitude)."
In the same way a machine process has seemingly "full" access the the machines memory via virtual memory addressing, the CUDA API enables seemingly full and "infinite" access to threads independent of the underlying GPUs capability / limits.
2. maxComputeInvocationsPerWorkgroup
defines the maximum value of the product of the workgroup_size
dimensions x, y, and z dimensions, and defaults to 256. The maximum workgroup_size for x-dimension is 256, y-dimension is 256, and z-dimension is 64, but their product might exceed maxComputeInvocationsPerWorkgroup
. Based on this article, there seems to be a negligible difference between @workgroup_size(256) or @workgroup_size(128, 2) or @workgroup_size(64, 2, 2)
.
The current bulk of the runtime occurs during the SMVP stage, where we're launching 64x * 2y * 16z * 256 workgroup_size
= 524,288 threads. Here's a comparison of a top-grade server Nvidia GPU (A10) for context:
There must be some internal queue in WebGPU, as you suggested, that's executing chunks of 524,288 threads in lockstep since the underlying GPU devices clearly cannot handle that thread load.
Currently, we're doing
await smvp_gpu(
smvp_shader,
s_num_x_workgroups,
s_num_y_workgroups,
s_num_z_workgroups,
device,
commandEncoder,
num_subtasks,
num_columns,
input_size,
chunk_size,
all_csc_col_ptr_sb,
point_x_sb,
point_y_sb,
all_csc_val_idxs_sb,
bucket_sum_x_sb,
bucket_sum_y_sb,
bucket_sum_t_sb,
bucket_sum_z_sb,
)
which takes X amount of time. Reducing the s_num_x_workgroups
by half to s_num_x_workgroups / 2
reduces the execution time by 500ms (so clearly the number of threads being launched is the primary factor that affects performance more than anything else). Running the smvp_gpu shader twice with s_num_x_workgroups / 2
seems to take as long as a single shader invocation with s_num_x_workgroups
threads.
This preliminarily suggests that splitting up the execution of the SMVP / bucket reduction shader into multiple separate calls might actually yield similar performance.
With these tweaks, 2 ** 16 MSM is now sub-1s on my machine around half of the time.
It seems that setting the workgroup_size to 256 is the best strategy because this reduces the number of workgroup dispatches.