Use barrier(CLK_LOCAL_MEM_FENCE) instead of barrier(0)

jmmartinez commented 1 year ago

Hello,

We found a problem when using gpuowl when testing a ROCm release.

It seems to me that barrier(0) is being used to synchronaize the work-items in a workgroup after they write to shared memory.

But when the argument is set to 0, it is not ensured that the changes done to shared memory are visible by all work-items in the workgroup.

Instead, we propose using CLK_LOCAL_MEM_FENCE:

CLK_LOCAL_MEM_FENCE - ensure that all local memory accesses become visible to all work-items in the work-group. Note that the value of scope is ignored as the memory scope is always memory_scope_work_group.

Don't hesitate to raise any issues in the patch I'm proposing, I started using OpenCL quite recently.

Best!

jmmartinez commented 1 year ago

Hi @preda , Do you think this patch may be interesting?

selroc commented 1 year ago

Hi @preda , Do you think this patch may be interesting?

I have tested it and it works, but I think that performance suffers from it. I didn't do extensive tests, only a quick run.

preda commented 1 year ago

I'm taking a look now.

preda commented 1 year ago

The change you propose reduces performance (by about 2%) on ROCm 5.5.1, RadeonVII, on exponent 113203711 which uses FFT 1K:12:256. Whats more, without the change ("as is") there is no problem observed on ROCm/RadeonVII.

Here is what I observed looking at the generated ISA: For a kernel declared with group-size=64 (which thus does not need s_barrier on R7), without the change I see

ds_write_b128 v102, v[13:16]
ds_write_b128 v102, v[17:20] offset:16
; wave barrier
ds_read2st64_b64 v[29:32], v101 offset1:1
ds_read2st64_b64 v[33:36], v101 offset0:2 offset1:3

while with the change I see

ds_write_b128 v102, v[13:16]
ds_write_b128 v102, v[17:20] offset:16
s_waitcnt lgkmcnt(0)
; wave barrier
s_waitcnt lgkmcnt(0)
ds_read2st64_b64 v[29:32], v101 offset1:1
ds_read2st64_b64 v[33:36], v101 offset0:2 offset1:3

As you see, some redundant s_waitcnt lgkmcnt(0) are generated. This looks like a compiler shortcoming.

preda commented 1 year ago

We found a problem when using gpuowl when testing a ROCm release.

Please provide repro for the problem you found. I'd like to reproduce it myself. If I can confirm it, I'll be looking for a fix, including along your proposed change.

jmmartinez commented 1 year ago

Hello!

I run gpuowl as follows ./gpuowl -prp 84682337 (TBH I don't know why these parameters are used, I'm not familiar with gpuowl).

I'm able to reproduce the issue on a gfx1030.

The issue appears after "[AMDGPU] Omit unnecessary waitcnt before barriers" landed in LLVM which removes s_waitcnt inserted before s_barrier for gfx90a, gfx1010, gfx1030 and gfx940 targets.

I'm currenlty trying to figure out how to get a rocm-beta-release containing that commit that I can share.

In the meantime, I'll show the asm I've got.

Before "Omit unnecessary waitcnt before barriers"

Before the LLVM patch, I've got the following assmebly:

ds_write2_b64 v113, v[2:3], v[4:5] offset1:16
ds_write2_b64 v113, v[20:21], v[22:23] offset0:32 offset1:48
s_waitcnt lgkmcnt(0)
s_barrier
ds_read2st64_b64 v[0:3], v110 offset1:1
ds_read2st64_b64 v[4:7], v110 offset0:2 offset1:3

After "Omit unnecessary waitcnt before barriers"

The s_waitcnt is gone (as expected).

ds_write2_b64 v113, v[2:3], v[4:5] offset1:16
ds_write2_b64 v113, v[20:21], v[22:23] offset0:32 offset1:48
s_barrier
ds_read2st64_b64 v[0:3], v110 offset1:1
ds_read2st64_b64 v[4:7], v110 offset0:2 offset1:3

but when executing gpuowl I get the following output:

20230628 15:35:34 9ff6ee0b0512eb0c 84682337 OpenCL compilation in 2.42 s
20230628 15:35:34 9ff6ee0b0512eb0c 84682337 PRP starting from beginning
20230628 15:35:34 9ff6ee0b0512eb0c 84682337 EE         0 on-load: 0000000000000000 vs. 0000000000000003
20230628 15:35:34 9ff6ee0b0512eb0c 84682337 PRP starting from beginning
20230628 15:35:35 9ff6ee0b0512eb0c 84682337 EE         0 on-load: 0000000000000000 vs. 0000000000000003
20230628 15:35:35 9ff6ee0b0512eb0c  Exiting because "error on load"
20230628 15:35:35 9ff6ee0b0512eb0c  Bye

This PR assembly after "Omit unnecessary waitcnt before barriers"

With this PR, I get the following asembly:

ds_write2_b64 v113, v[2:3], v[4:5] offset1:16
ds_write2_b64 v113, v[20:21], v[22:23] offset0:32 offset1:48
s_waitcnt vmcnt(0) lgkmcnt(0)
s_waitcnt_vscnt null, 0x0
s_barrier
s_waitcnt vmcnt(0) lgkmcnt(0)
s_waitcnt_vscnt null, 0x0
buffer_gl0_inv
ds_read2st64_b64 v[0:3], v110 offset1:1
ds_read2st64_b64 v[4:7], v110 offset0:2 offset1:3

I agree that the s_waitcnt sandwich looks redundant. I'm currently trying to figure out if we can safely remove some of those waits in the compiler.

However, I still think that this patch is relevant. After the different threads queue their writes to the local-data-share and they synchronize at the barrier; there is currently nothing ensuring that the changes written by one thread are visible to other threads in the work-group.

I'll try to get back soon with the rocm-release candidate containing the patch that triggers the issue.

Thanks again for your time!

preda commented 1 year ago

Thanks! I'll apply the fix.

jmmartinez commented 1 year ago

@preda ping :) was there any update on this issue ?

I've created a patch on llvm to fix the waitcnt issue you mentioned. I hope it will make it into LLVM soon.

preda commented 1 year ago

Thank you!

preda / gpuowl