intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.23k stars 734 forks source link

Deprecate and remove `nd_item::barrier`. #12531

Open bader opened 8 months ago

bader commented 8 months ago

SYCL 2020 removed this function completely, in favor of sycl::group_barrier.

_Originally posted by @Pennycook in https://github.com/intel/llvm/pull/12236#discussion_r1470281101_

jinz2014 commented 8 months ago

Could you please explain why nd_item::barrier needs to be replaced by sycl::group_barrier ? I am not a customer mentioned in the comments, but I think many users' SYCL programs still call item.barrier().

bader commented 8 months ago

There is no nd_item::barrier in SYCL 2020 specification.

kris-rowe commented 8 months ago

Removing the existing nd_item::barrier will have a non-trivial performance impact on applications that use this with the access::fence_space::local_space flag. In this case, __spirv_ControlBarrier is called using the WorkgroupMemory mask returned from detail::getSPIRVMemorySemanticsMask.

In the SYCL 2020 spec group_barrier doesn't accept an equivalent to the local_space flag. The current implementation uses SPIRV CrossWorkgroupMemory memory semantics for all calls to group_barrier.

bader commented 8 months ago

@Pennycook, is there any public document addressing performance concerns of moving from nd_item::barrier to group_barrier?

Pennycook commented 8 months ago

@Pennycook, is there any public document addressing performance concerns of moving from nd_item::barrier to group_barrier?

Not that I'm aware of. Any performance impact of the change is both implementation- and hardware-specific.

jinz2014 commented 8 months ago

@bader I may confuse item.barrier() with nd_item::barrier()

MikeDvorskiy commented 4 months ago

@bader Alexey, which version macro (see below) will be changed and what is a new value for the changed macro after the PR will be merged? LIBSYCL_MAJOR_VERSION LIBSYCL_MINOR_VERSION __LIBSYCL_PATCH_VERSION

bader commented 4 months ago

@MikeDvorskiy, as this is API breaking change, all of them will change. I can't say what will be new values at this moment as there is PR with this change yet (as far as I know). Tagging folks who can follow-up on that issue: @sergey-semenov, @AlexeySachkov, @steffenlarsen.

al42and commented 4 months ago

@Pennycook, is there any public document addressing performance concerns of moving from nd_item::barrier to group_barrier?

A quick test on GROMACS shows that replacing a bunch of itemIdx.barrier(fence_space::local_space) with sycl::group_barrier(itemIdx.get_group()) slows down one of the kernels (PmeGatherKernel<4, true, true, 1, false, (ThreadsPerAtom)0, 16>) nearly 1.8x (from 270µs to 470µs for 768k Grappa system) on PVC 1100.

This kernel is not super-optimized (it is responsible for ~4% of the runtime, at least when using nd_item::barrier) and perhaps this slowdown can be worked around. But still, the performance concerns are real.

Some earlier microbenchmarks: https://github.com/intel/llvm/issues/7069#issuecomment-1281168590

Pennycook commented 3 weeks ago

@bader, @al42and: The performance issue was recently root-caused to a bug in IGC. With the fix from https://github.com/intel/intel-graphics-compiler/commit/ed639f68d142bc963a7b626badc207a42fb281cb, you should still expect some small overhead from the difference in fence semantics, but performance should be significantly better than it was before.