Closed Jasper-Bekkers closed 3 years ago
It looks like this issue is significantly worse then expected, since non-templated loads exhibit the same bad behavior:
http://shader-playground.timjones.io/1ff2f9b80ad8cb73d7cc64f60edb2804 http://shader-playground.timjones.io/88a55360960ebd185281324bb19511d5 http://shader-playground.timjones.io/91e117daf519511a033a9c3a8daf457d
The expected output would be to have one OpTypeVector of 4 elements, and one OpLoad on that.
As an example, see the following RDNA ISA as outputted from the RGA tool:
; -------- Disassembly --------------------
shader main
asic(GFX10)
type(CS)
sgpr_count(6)
vgpr_count(8)
wave_size(64)
s_inst_prefetch 0x0003 // 000000000000: BFA00003
s_getpc_b64 s[0:1] // 000000000004: BE801F80
s_mov_b32 s0, s2 // 000000000008: BE800302
s_load_dwordx4 s[4:7], s[0:1], null // 00000000000C: F4080100 FA000000
s_and_b32 s0, s3, lit(0x00fffffc) // 000000000014: 8700FF03 00FFFFFC
v_mov_b32 v0, s0 // 00000000001C: 7E000200
s_waitcnt lgkmcnt(0) // 000000000020: BF8CC07F
s_clause 0x0003 // 000000000024: BFA10003
buffer_load_dword v1, v0, s[4:7], 0 offen // 000000000028: E0301000 80010100
buffer_load_dword v2, v0, s[4:7], 0 offen offset:4 // 000000000030: E0301004 80010200
buffer_load_dword v3, v0, s[4:7], 0 offen offset:8 // 000000000038: E0301008 80010300
buffer_load_dword v4, v0, s[4:7], 0 offen offset:12 // 000000000040: E030100C 80010400
s_waitcnt vmcnt(3) // 000000000048: BF8C3F73
v_cvt_u32_f32 v1, v1 // 00000000004C: 7E020F01
s_waitcnt vmcnt(2) // 000000000050: BF8C3F72
v_cvt_u32_f32 v2, v2 // 000000000054: 7E040F02
s_waitcnt vmcnt(1) // 000000000058: BF8C3F71
v_cvt_u32_f32 v3, v3 // 00000000005C: 7E060F03
s_waitcnt vmcnt(0) // 000000000060: BF8C3F70
v_cvt_u32_f32 v4, v4 // 000000000064: 7E080F04
buffer_store_dword v1, v0, s[4:7], 0 offen glc // 000000000068: E0705000 80010100
buffer_store_dword v2, v0, s[4:7], 0 offen offset:4 glc // 000000000070: E0705004 80010200
buffer_store_dword v3, v0, s[4:7], 0 offen offset:8 glc // 000000000078: E0705008 80010300
buffer_store_dword v4, v0, s[4:7], 0 offen offset:12 glc // 000000000080: E070500C 80010400
s_endpgm // 000000000088: BF810000
Which also emits 4 buffer_load_dword
and 4 buffer_store_dword
whereas one buffer_store_dwordx4
would do. Notice also that even though the compiler correctly detects that this can be in a single clause, it still also emits 4 s_waitcnt
ops. Arguably AMD should have a LoadStoreVectorizer pass similar to what's found in LLVM, however in practice it seems that NVIDIA also doesn't so such a pass and relies on the source compiler (DXC in this case) to do the right thing.
@Jasper-Bekkers Thank you for reporting this issue. I will take a look and get back to you.
@jaebaek Thanks! Could this possibly have any relation to https://github.com/microsoft/DirectXShaderCompiler/issues/3370?
This is a limitation of SPIR-V. If the variable is typed to be a runtime array of uints (as it is here), then only a single uint can loaded at a time. There is a potential optimization of representing the variable as a runtime array of uint4s (or an aggregate of larger values), but that complicates the logic to calculate the addresses.
I agree with @alan-baker . It works correctly but inefficient, which is a limitation of SPIR-V.
Since the given ByteAddressBuffer is a buffer of "bytes" (it generates uint
buffer in SPIR-V), converting it into an arbitrary type using the templated load definitely needs some type-cast for each element. Currently, we do not have SPIR-V instructions for doing it efficiently.
Currently, we do not have SPIR-V instructions for doing it efficiently.
Might be better to close this issue then, and follow up directly within Khronos instead what do you think?
Yes, we can close it and open another one when we have some solutions for this.
Output:
Compared to:
Run with
dxc -spirv -Tps_6_5 -EPSMain test.hlsl