Needed changes for Red Dead Redemption

raphaelthegreat commented 1 month ago

Collection of backend changes that need proper implementation before RDR can boot on main

[ ] Address control flow issues when VCC is used as EXEC
[ ] Fix texture cache image overlap assertion (likely missing some depth defining draw)
[ ] Proper implementation of V_READLANE_B32/V_WRITELANE_B32 for both amd and nvidia
[ ] Handling of V_CMP_CLASS_F32 with SGPR as mask (needs handling in constant propagation pass)
[ ] Investigate RGBA swizzling issues making the characters blue

red-prig commented 1 month ago

v_readlane_b32 is actually reading from a two-dimensional array of VGPR registers with indices: VGRP register directly + lane index in wavefront [0..63]. There was a proposal to emulate this through an external buffer, but I can't imagine how to ensure synchronization of 2 wavefront execution in the case of NVIDIA, the question also arises of how many buffers should be allocated for shader execution and how to calculate it. I also had an idea earlier to make all 64 lanes executed within a single shader call, this will require significant changes in the shader recompiler, as well as probably additional extensions to make a read of 64 pixels simultaneously in a pixel shader, for a vertex shader it looks simpler, you just need to turn the incoming attribute into a read from the buffer.

red-prig commented 1 month ago

I will also add that, if the registers used for readlane/writelane do not depend on other operations, then this is simple indexing in an array.

raphaelthegreat commented 1 month ago

RDR basically needs V_READFIRSTLANE_B32, V_READLANE_B32 and V_WRITELANE_B32.

The instruction V_READLANE_B32 is basically hw implementation of subgroupBroadcast. So when warp size is 32 we only concern ourselves with pairs of subgroups. We initialize a shared memory region where each slot is for a pair of subgroups to communicate with each other. An implementation I had in my mind will look like the following:

shared uint data[NumSubgroups >> 1];
const uint warp_id = lane_id >> 5;
if (gl_SubgroupID & 1 == warp_id) {
    data[gl_SubgroupID >> 1] = subgroupBroadcast(value, lane_id - 32 * warp_id);
}
barrier();
uint result = data[gl_SubgroupID >> 1];

The barrier() call with synchronize all subgroups in the current workgroup.

For V_READFIRSTLANE_B32 is basically hw implementation of subgroupBroadcastFirst. This is simpler than above case as we know we always fetch data from even subgroup ids

shared uint data[NumSubgroups >> 1];
if (gl_SubgroupID & 1 == 0) {
    data[gl_SubgroupID >> 1] = subgroupBroadcastFirst(value);
}
barrier();
uint result = data[gl_SubgroupID >> 1];

red-prig commented 1 month ago

Ok, can we calculate the number of subgroup pairs to accurately calculate the required number of buffers for each parallel execution case?

raphaelthegreat commented 1 month ago

Yes when compiling the shader we know the local workgroup size so we always know have many subgroups it will have

raphaelthegreat commented 1 month ago

NumSubgroups = local_size_x local_size_y local_size_z >> 5 for NVIDIA

Hermiten commented 1 week ago

Hey, good time to update it ? :)

shadps4-emu / shadPS4

Needed changes for Red Dead Redemption #331