Closed jmmartinez closed 1 year ago
Hi @preda , Do you think this patch may be interesting?
Hi @preda , Do you think this patch may be interesting?
I have tested it and it works, but I think that performance suffers from it. I didn't do extensive tests, only a quick run.
I'm taking a look now.
The change you propose reduces performance (by about 2%) on ROCm 5.5.1, RadeonVII, on exponent 113203711 which uses FFT 1K:12:256. Whats more, without the change ("as is") there is no problem observed on ROCm/RadeonVII.
Here is what I observed looking at the generated ISA: For a kernel declared with group-size=64 (which thus does not need s_barrier on R7), without the change I see
ds_write_b128 v102, v[13:16]
ds_write_b128 v102, v[17:20] offset:16
; wave barrier
ds_read2st64_b64 v[29:32], v101 offset1:1
ds_read2st64_b64 v[33:36], v101 offset0:2 offset1:3
while with the change I see
ds_write_b128 v102, v[13:16]
ds_write_b128 v102, v[17:20] offset:16
s_waitcnt lgkmcnt(0)
; wave barrier
s_waitcnt lgkmcnt(0)
ds_read2st64_b64 v[29:32], v101 offset1:1
ds_read2st64_b64 v[33:36], v101 offset0:2 offset1:3
As you see, some redundant s_waitcnt lgkmcnt(0) are generated. This looks like a compiler shortcoming.
We found a problem when using gpuowl when testing a ROCm release.
Please provide repro for the problem you found. I'd like to reproduce it myself. If I can confirm it, I'll be looking for a fix, including along your proposed change.
Hello!
I run gpuowl as follows ./gpuowl -prp 84682337
(TBH I don't know why these parameters are used, I'm not familiar with gpuowl).
I'm able to reproduce the issue on a gfx1030.
The issue appears after "[AMDGPU] Omit unnecessary waitcnt before barriers" landed in LLVM which removes s_waitcnt
inserted before s_barrier
for gfx90a, gfx1010, gfx1030 and gfx940 targets.
I'm currenlty trying to figure out how to get a rocm-beta-release containing that commit that I can share.
In the meantime, I'll show the asm I've got.
Before the LLVM patch, I've got the following assmebly:
ds_write2_b64 v113, v[2:3], v[4:5] offset1:16
ds_write2_b64 v113, v[20:21], v[22:23] offset0:32 offset1:48
s_waitcnt lgkmcnt(0)
s_barrier
ds_read2st64_b64 v[0:3], v110 offset1:1
ds_read2st64_b64 v[4:7], v110 offset0:2 offset1:3
The s_waitcnt
is gone (as expected).
ds_write2_b64 v113, v[2:3], v[4:5] offset1:16
ds_write2_b64 v113, v[20:21], v[22:23] offset0:32 offset1:48
s_barrier
ds_read2st64_b64 v[0:3], v110 offset1:1
ds_read2st64_b64 v[4:7], v110 offset0:2 offset1:3
but when executing gpuowl I get the following output:
20230628 15:35:34 9ff6ee0b0512eb0c 84682337 OpenCL compilation in 2.42 s
20230628 15:35:34 9ff6ee0b0512eb0c 84682337 PRP starting from beginning
20230628 15:35:34 9ff6ee0b0512eb0c 84682337 EE 0 on-load: 0000000000000000 vs. 0000000000000003
20230628 15:35:34 9ff6ee0b0512eb0c 84682337 PRP starting from beginning
20230628 15:35:35 9ff6ee0b0512eb0c 84682337 EE 0 on-load: 0000000000000000 vs. 0000000000000003
20230628 15:35:35 9ff6ee0b0512eb0c Exiting because "error on load"
20230628 15:35:35 9ff6ee0b0512eb0c Bye
With this PR, I get the following asembly:
ds_write2_b64 v113, v[2:3], v[4:5] offset1:16
ds_write2_b64 v113, v[20:21], v[22:23] offset0:32 offset1:48
s_waitcnt vmcnt(0) lgkmcnt(0)
s_waitcnt_vscnt null, 0x0
s_barrier
s_waitcnt vmcnt(0) lgkmcnt(0)
s_waitcnt_vscnt null, 0x0
buffer_gl0_inv
ds_read2st64_b64 v[0:3], v110 offset1:1
ds_read2st64_b64 v[4:7], v110 offset0:2 offset1:3
I agree that the s_waitcnt
sandwich looks redundant. I'm currently trying to figure out if we can safely remove some of those waits in the compiler.
However, I still think that this patch is relevant. After the different threads queue their writes to the local-data-share and they synchronize at the barrier; there is currently nothing ensuring that the changes written by one thread are visible to other threads in the work-group.
I'll try to get back soon with the rocm-release candidate containing the patch that triggers the issue.
Thanks again for your time!
Thanks! I'll apply the fix.
@preda ping :) was there any update on this issue ?
I've created a patch on llvm to fix the waitcnt issue you mentioned. I hope it will make it into LLVM soon.
Thank you!
Hello,
We found a problem when using gpuowl when testing a ROCm release.
It seems to me that
barrier(0)
is being used to synchronaize the work-items in a workgroup after they write to shared memory.But when the argument is set to 0, it is not ensured that the changes done to shared memory are visible by all work-items in the workgroup.
Instead, we propose using CLK_LOCAL_MEM_FENCE:
Don't hesitate to raise any issues in the patch I'm proposing, I started using OpenCL quite recently.
Best!