Semantics of gpu_thread_barrier()

shoaibkamil commented 4 years ago

The intrinsic gpu_thread_barrier() currently has different memory fence semantics on different platforms. We should decide & document what the semantics are. In addition, if we want fences for global/device memory as well as shared memory, then we need to change backends to respect the semantics correctly.

Currently, I believe the semantics are:

CUDA: threadgroup barrier, fence for shared & global memory
OpenCL: threadgroup barrier, fence for shared mem only
Metal: threadgroup barrier, fence for shared mem only
D3D: threadgroup barrier, fence for shared mem only
OpenGLCompute: threadgroup barrier, fence for shared & global memory (I think)

abadams commented 4 years ago

Ugh, so if we have a Func compute_at gpu blocks, and we elect to store it in MemoryType::Heap instead of MemoryType::GPUShared, because it doesn't fit in shared, then we may not be currently generating correct code?

shoaibkamil commented 4 years ago

Yes, I believe that's correct. It would be correct for CUDA and OpenGLCompute, but not anything else :-/

halide / Halide

Semantics of gpu_thread_barrier() #4967