Open Epliz opened 2 years ago
@llvm/issue-subscribers-backend-amdgpu
For the second point, I guess the compiler can't do much as multiplying the 32 bit index with the type size might require a bigger type... Would it be possible to have intrinsics to have access to the base + offset versions of the load/store instructions?
If you have the opportunity, please ask the hardware engineers to add load/store instructions with addressing modes accepting 32/64 bit signed/unsigned offsets with strides of 1,2,4,8 .
For anyone who might read this, it is possible to get the "nice" load/store instructions that use SGPR pair + VGPR offset to be used with code like the following:
template<typename T>
__device__
static T read(T const* a, uint32_t i) {
const uint8_t* a_ = (const uint8_t*) a;
// keeping byte offset in 32 bit unsigned
// makes it possible to use SGPR-pair + VGPR offset load instruction
uint32_t byte_off = i * sizeof(T);
return *((const T*) (a_ + byte_off));
}
template<typename T>
__device__
static void write(T * a, uint32_t i, T val) {
uint8_t* a_ = (uint8_t*) a;
// keeping byte offset in 32 bit unsigned
// makes it possible to use SGPR-pair + VGPR offset store instruction
uint32_t byte_off = i * sizeof(T);
*((T*) (a_ + byte_off)) = val;
}
__global__
void better_indexing(float const* a, float const* b, float const* c, float* d, size_t N, size_t C, uint32_t offset) {
size_t blockStart = (blockIdx.x * blockDim.x) * C; // warp uniform value, will go in SGPR
if (blockStart >= N) {
return;
}
// C * warpSize assumed to fit in uint32_t
uint32_t blockMax = (uint32_t) std::min(C * warpSize, N - blockStart);
// align pointers
const float* a_ = &a[blockStart]; // warp uniform value, will go in SGPR pair
const float* b_ = &b[blockStart]; // warp uniform value, will go in SGPR pair
const float* c_ = &c[blockStart]; // warp uniform value, will go in SGPR pair
float* d_ = &d[blockStart]; // warp uniform value, will go in SGPR pair
uint32_t subOffset = threadIdx.x;
for (uint32_t i = 0; i < C; i++, subOffset += warpSize) {
if (subOffset < blockMax) {
float r = read(a_, subOffset) + read(b_, subOffset) + read(c_, subOffset);
write(d_, subOffset, r);
}
}
}
The loop part gives the following nice assembly:
.LBB1_3: ; in Loop: Header=BB1_4 Depth=1
s_or_b64 exec, exec, s[12:13]
v_mov_b32_e32 v2, s2
v_mov_b32_e32 v3, s3
v_cmp_lt_u64_e32 vcc, s[10:11], v[2:3]
v_add_u32_e32 v0, 64, v0
s_add_i32 s10, s10, 1
v_add_u32_e32 v1, 0x100, v1
s_and_b64 vcc, exec, vcc
s_cbranch_vccz .LBB1_6
.LBB1_4: ; =>This Inner Loop Header: Depth=1
v_cmp_gt_u32_e32 vcc, s14, v0
s_and_saveexec_b64 s[12:13], vcc
s_cbranch_execz .LBB1_3
; %bb.5: ; in Loop: Header=BB1_4 Depth=1
global_load_dword v2, v1, s[0:1]
global_load_dword v3, v1, s[4:5]
global_load_dword v4, v1, s[6:7]
s_waitcnt vmcnt(1)
v_add_f32_e32 v2, v2, v3
s_waitcnt vmcnt(0)
v_add_f32_e32 v2, v2, v4
global_store_dword v1, v2, s[8:9]
s_branch .LBB1_3
@EugeneZelenko may I take this one?
@snikitav: Will be good idea to ask @arsenm, who is most active developer of the backend.
Would it be possible to have intrinsics to have access to the base + offset versions of the load/store instructions?
No, that's a maintenance headache. For the backend, and user code. The addressing modes change all the time. We do try to make use of these, but the addressing modes are less useful than you would hope since we need to prove lack of overflow to reassociate the expression into the forms they need.
Hi,
I have observed some suboptimal generated assembly for what seems like pretty common code patterns with Clang 14.0.0 coming from ROCM 5.0.2.
For this example kernel code:
The generated assembly (at
-O3
) is:We can observe two issues with the code: 1) we use several vgprs for indexing the same array several times while we could just re-use the same vgpr-pair for the incremented address 2) incrementing the index with
globalThreadId += offset;
actually causes as many vgpr additions as there are accessed arrays while it could use the variantglobal_load_dword, vdest, voffset, saddr
for the loads as the offset is aint32_t
, which then would mean we would need to increment only the offset once for all arrays.I think that both issues are quite a problem.
I hope that this can get fixed in a future release, if it has already been addressed, let me know and please disregard this.
Best regards, Epliz.