oneapi-src / oneDPL

oneAPI DPC++ Library (oneDPL) https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-library.html
Apache License 2.0
720 stars 114 forks source link

Replace SYCL 1.2.1 group barrier with SYCL 2020 alternative #1679

Open SergeyKopienko opened 2 months ago

SergeyKopienko commented 2 months ago

https://github.com/oneapi-src/oneDPL/blob/470df99e5f27ab8da3ee55941ff9d6a9e0aa9730/include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h#L170

Please see details in https://github.com/intel/llvm/issues/12531 Please see details in https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_group_broadcast

dmitriy-sobolev commented 2 months ago

There are two major reasons of using the outdated barrier API:

  1. Performance. You can find the details here: https://github.com/intel/llvm/issues/12531.
  2. Non-compatible semantics, as mentioned in the code comments, which may require some work: https://github.com/oneapi-src/oneDPL/blob/470df99e5f27ab8da3ee55941ff9d6a9e0aa9730/include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h#L171-L174

I suppose that the most important reason is 1. However, oneDPL is claimed to be SYCL 2020 conformant, so SYCL 2020 group_barrier should be used, at least for the compilers other than oneAPI DPC++ compiler.

akukanov commented 2 months ago

Let's finally clarify the question of semantics.

The __nd_item.barrier(sycl::access::fence_space::local_space) that is currently in use has the following semantics in SYCL 1.2.1:

Executes a work-group barrier with memory ordering on the local address space ... . The current work-item will wait at the barrier until all work-items in the current work-group have reached the barrier. In addition the barrier performs a fence operation ensuring that all memory accesses in the specified address space issued before the barrier complete before those issued after the barrier.

In other words, it serves as both a barrier for work items and as a memory fence (with unclear ordering semantics, but at least acquire-release as it seems from the description) for operations within local (i.e. work group) memory scope.

The work-group barrier sycl::group_barrier(__nd_item.get_group(), sycl::memory_scope::work_group) has the following semantics in SYCL 2020:

Synchronizes all work-items in a group. The current work-item will wait at the barrier until all work-items in the group have reached the barrier. In addition, the barrier performs mem-fence operations ensuring that memory accesses issued before the barrier are not re-ordered with those issued after the barrier: all work-items in the group execute a release fence prior to synchronizing at the barrier, all work-items in the group execute an acquire fence afterwards, and there is an implicit synchronization of these fences as if provided by an explicit atomic operation on an atomic object. By default, the scope of these fences is set to the narrowest scope including all work-items in the group ... This scope may be optionally overridden with a wider scope, specified by the fence_scope argument.

In other words, it serves as both the barrier and the acquire-release memory fence in the specified memory scope, which is the group scope sycl::memory_scope::work_group (and which is also the default for work-groups, so it can as well be implicit).

The differences I observe are:

All in all, it seems that the new group_barrier can be used in the same way / with the same effect as the old one.

I think the comment telling about them being "not quite equivalent" is there either because earlier versions of SYCL 2020 did not provide enough clarity or because the barrier memory ordering semantics of 1.2.1 were confused with those of atomics, for which 1.2.1 only supported relaxed memory ordering. But I do not believe that the relaxed ordering would satisfy the described "complete before" requirement.

al42and commented 2 months ago

All in all, it seems that the new group_barrier can be used in the same way / with the same effect as the old one.

As far as I understand, another difference (and the reason for performance drop mentioned in the linked issues) is that the old version "Executes a work-group barrier with memory ordering on the local address space", while the new version affects all memory operations, in both local and global address spaces (but both only within work-group/local scope):

sycl::memory_scope::work_group The ordering constraint applies only to work-items in the same work-group as the calling work-item;

akukanov commented 2 months ago

Thanks @al42and - indeed, this is an important difference that I missed, and that impacts performance.

If that difference is important for oneDPL code, then we should make it visible - either in the oneDPL wrapper name or maybe with a template parameter - that this barrier orders operations only for data in local memory but does not order global data accesses.