Performance improvements of `__parallel_find_or` + `__device_backend_tag`

SergeyKopienko commented 1 month ago

In this PR we made some performance improvements:

some performance improvements for __parallel_find_or + __device_backend_tag for the usage with the __parallel_or_tag.

The changes for

__pattern_any_of(__hetero_tag has been implemented on __parallel_transform_reduce;
__pattern_find_if(__hetero_tag has been implemented on __parallel_transform_reduce; has been extracted to the separate PR https://github.com/oneapi-src/oneDPL/pull/1622 and rolled back in this PR.

SergeyKopienko commented 1 month ago

I am going to remove all comments like // Point # after receive some reviews. It may help to compare differences between some specializations for the __parallel_or_tag and for the __parallel_find_forward_tag, __parallel_find_backward_tag

SergeyKopienko commented 1 month ago

@MikeDvorskiy, how do you think, does it make sense to apply this changes for range's implementations too?

Actually, we should no duplicate the changes. The iterator based __pattern_any_of and range based __pattern_any_of recall a range based corresponding hetero backend pattern: oneapi::dpl::__par_backend_hetero::__parallel_find_or.

So, with right code design perspective you should modify oneapi::dpl::__par_backend_hetero::__parallel_find_or or another hetero backend new find pattern (reduction based f.e).

danhoeflinger commented 1 month ago

As written, it seems like the parallel_find_or case seems to be doing a || reduction using global atomic variables to combine the results from different work items. I would expect that for this to be better than the transform reduce pattern, we would need a real "early exit" check at the beginning of a workitem's work (or every m iterations) which could exit if the global atomic was true.

MikeDvorskiy commented 1 month ago

a workitem's work (or every m iterations) which could exit if the global atomic was true.

Theoretically, it might be true. But, practically, access to a global atomic is very "expensive"..

danhoeflinger commented 1 month ago

Theoretically, it might be true. But, practically, access to a global atomic is very "expensive"..

I agree. I'm not sure without running benchmarks what the best option is. I'll elaborate more to explain the point of my comment.

TLDR: I think that we should call the transform reduce pattern here, and possibly make a specialization of reduce specific to std::logical_or / std::logical_and if that will help from a performance perspective which might include local and/or global "early exits". Also, I think the implementation as written makes the assumption that the data is "sparse" or mostly the identity for the operation, and we should not assume this. My guess is that if a global atomic is used in this specialization for an early exit, we will need to take more steps to reduce contention.

More details:

Right now, this implementation looks like a std::logical_or specialized atomic-based reduction kernel for commutative operations with a couple augmentations: 1) It checks for the identity (false), and opts out of touching the global atomic when it finds the identity. 2) It short circuits (or early exits) an individual workitems's work when it finds a true.

For (1) Lets define an input data sequence to be "sparse" if each element has a very high likelihood to be the identity of the binary operation in a reduction operation. This reduction implementation should perform well with sparse data as it would limit congestion on the global atomic variable by opting out of touching the global variable when the reduction of all an individual workitem's reductions results in the identity (all data in a workitem is false). However, I don't know of a reason that we should assume input sequences to this pattern would be sparse. I would expect our normal (commutative) reduction algorithm to perform better for non-sparse data due to congestion on the atomic. In any case, I would still think we would want to use the commutative reduce pattern here. Then, if necessary, improve the performance of the reduce pattern itself.

For (2) Similarly to above, it could be interesting to consider if it is worthwhile to have some specialization for our reduction algorithm for std::logical_and, std::logical_or which takes advantage of an early exit (which is relevant for these specific reduce calls as well). The early exit at the work-item level (as done here) should be possible with our without atomics, but we should check that it helps. We could consider "early exit" at the subgroup or workgroup level, using sycl::any_of_group (or local atomics), this is also orthogonal to the question of using global atomics though.

One possible justification for the global atomic beyond an assumption of sparse data could be to utilize it as a global "early exit". However, that would need to be proven generally beneficial on a variety of input data. There are different approaches we could try here as well, for instance you could involve the global atomic only to implement the early exit, and not involve it in the return value. A single workitem could update this global atomic at the end of a workgroup's work, and workitems could check (read-only) this global atomic on startup, or periodically. I think there is a way we can reduce the contention impact of the "atomic reads" by using a relaxed memory order, as we don't really care when the read happens, but only that it is valid at a single point in time when it does happen. This is merely a signal to exit early, so if it is missed, it is not a correctness issue.

oneapi-src / oneDPL

Performance improvements of `__parallel_find_or` + `__device_backend_tag` #1617