A related issue: If the inner size <= warpSize a warp-wide barrier should be added. Currently no @barrier is added at all. That's tricky at least for Nvidia's Volta and later architectures (you can no longer assume that the threads in a wrap run in lock-step).
This is also relevant for OpenCL and SYCL/DPC++ since the innermost @inner loop will be mapped to a sub-group. The new versions of the standards support sub-group barriers.
A related issue: If the inner size <= warpSize a warp-wide barrier should be added. Currently no
@barrier
is added at all. That's tricky at least for Nvidia's Volta and later architectures (you can no longer assume that the threads in a wrap run in lock-step).Originally posted by @stgeke in https://github.com/libocca/occa/issues/484#issuecomment-919249600