Open bader opened 8 months ago
Could you please explain why nd_item::barrier needs to be replaced by sycl::group_barrier ? I am not a customer mentioned in the comments, but I think many users' SYCL programs still call item.barrier().
There is no nd_item::barrier
in SYCL 2020 specification.
Removing the existing nd_item::barrier
will have a non-trivial performance impact on applications that use this with the access::fence_space::local_space
flag. In this case, __spirv_ControlBarrier
is called using the WorkgroupMemory
mask returned from detail::getSPIRVMemorySemanticsMask
.
In the SYCL 2020 spec group_barrier
doesn't accept an equivalent to the local_space
flag. The current implementation uses SPIRV CrossWorkgroupMemory
memory semantics for all calls to group_barrier
.
@Pennycook, is there any public document addressing performance concerns of moving from nd_item::barrier
to group_barrier
?
@Pennycook, is there any public document addressing performance concerns of moving from
nd_item::barrier
togroup_barrier
?
Not that I'm aware of. Any performance impact of the change is both implementation- and hardware-specific.
@bader I may confuse item.barrier() with nd_item::barrier()
@bader Alexey, which version macro (see below) will be changed and what is a new value for the changed macro after the PR will be merged? LIBSYCL_MAJOR_VERSION LIBSYCL_MINOR_VERSION __LIBSYCL_PATCH_VERSION
@MikeDvorskiy, as this is API breaking change, all of them will change. I can't say what will be new values at this moment as there is PR with this change yet (as far as I know). Tagging folks who can follow-up on that issue: @sergey-semenov, @AlexeySachkov, @steffenlarsen.
@Pennycook, is there any public document addressing performance concerns of moving from nd_item::barrier to group_barrier?
A quick test on GROMACS shows that replacing a bunch of itemIdx.barrier(fence_space::local_space)
with sycl::group_barrier(itemIdx.get_group())
slows down one of the kernels (PmeGatherKernel<4, true, true, 1, false, (ThreadsPerAtom)0, 16>
) nearly 1.8x (from 270µs to 470µs for 768k Grappa system) on PVC 1100.
This kernel is not super-optimized (it is responsible for ~4% of the runtime, at least when using nd_item::barrier
) and perhaps this slowdown can be worked around. But still, the performance concerns are real.
Some earlier microbenchmarks: https://github.com/intel/llvm/issues/7069#issuecomment-1281168590
@bader, @al42and: The performance issue was recently root-caused to a bug in IGC. With the fix from https://github.com/intel/intel-graphics-compiler/commit/ed639f68d142bc963a7b626badc207a42fb281cb, you should still expect some small overhead from the difference in fence semantics, but performance should be significantly better than it was before.
SYCL 2020 removed this function completely, in favor of
sycl::group_barrier
._Originally posted by @Pennycook in https://github.com/intel/llvm/pull/12236#discussion_r1470281101_