Open raphaelthegreat opened 1 month ago
v_readlane_b32 is actually reading from a two-dimensional array of VGPR registers with indices: VGRP register directly + lane index in wavefront [0..63]. There was a proposal to emulate this through an external buffer, but I can't imagine how to ensure synchronization of 2 wavefront execution in the case of NVIDIA, the question also arises of how many buffers should be allocated for shader execution and how to calculate it. I also had an idea earlier to make all 64 lanes executed within a single shader call, this will require significant changes in the shader recompiler, as well as probably additional extensions to make a read of 64 pixels simultaneously in a pixel shader, for a vertex shader it looks simpler, you just need to turn the incoming attribute into a read from the buffer.
I will also add that, if the registers used for readlane/writelane do not depend on other operations, then this is simple indexing in an array.
RDR basically needs V_READFIRSTLANE_B32, V_READLANE_B32 and V_WRITELANE_B32.
The instruction V_READLANE_B32 is basically hw implementation of subgroupBroadcast. So when warp size is 32 we only concern ourselves with pairs of subgroups. We initialize a shared memory region where each slot is for a pair of subgroups to communicate with each other. An implementation I had in my mind will look like the following:
shared uint data[NumSubgroups >> 1];
const uint warp_id = lane_id >> 5;
if (gl_SubgroupID & 1 == warp_id) {
data[gl_SubgroupID >> 1] = subgroupBroadcast(value, lane_id - 32 * warp_id);
}
barrier();
uint result = data[gl_SubgroupID >> 1];
The barrier() call with synchronize all subgroups in the current workgroup.
For V_READFIRSTLANE_B32 is basically hw implementation of subgroupBroadcastFirst. This is simpler than above case as we know we always fetch data from even subgroup ids
shared uint data[NumSubgroups >> 1];
if (gl_SubgroupID & 1 == 0) {
data[gl_SubgroupID >> 1] = subgroupBroadcastFirst(value);
}
barrier();
uint result = data[gl_SubgroupID >> 1];
Ok, can we calculate the number of subgroup pairs to accurately calculate the required number of buffers for each parallel execution case?
Yes when compiling the shader we know the local workgroup size so we always know have many subgroups it will have
NumSubgroups = local_size_x local_size_y local_size_z >> 5 for NVIDIA
Hey, good time to update it ? :)
Collection of backend changes that need proper implementation before RDR can boot on main