oneapi-src / oneDPL

oneAPI DPC++ Library (oneDPL) https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-library.html
Apache License 2.0
715 stars 112 forks source link

[Perf] Break loop in __early_exit_find_or after first found element #1644

Closed SergeyKopienko closed 2 weeks ago

SergeyKopienko commented 2 weeks ago

In this PR we implement break of for-loop inside __early_exit_find_or after first found element. The approach from the PR https://github.com/oneapi-src/oneDPL/pull/1624 is applicable for the first/last element search. This approach gives us a good performance boost. Technically this break call implemented thought additional local variable with check inside of continue-condition in for-loop due performance results.

Also in this PR, we analyze the result of compare_exchange_strong() call and break loop if the value of atomic has been changed correctly to avoid extra atomic value load() calls.

SergeyKopienko commented 2 weeks ago

@danhoeflinger, @julianmi how do you think, are we ready to merge this PR ?

SergeyKopienko commented 2 weeks ago

LGTM. However, I'd like to understand why this is faster than break statements. There might be potential performance benefits in other code parts where we use break statements in case this is a general issue.

Just now I found one place with break pattern in Kernel code and going to investigate this place separatelly. It's inside struct multiple_match_pred

danhoeflinger commented 2 weeks ago

My guess about the break statements is that it is more difficult for thread divergence within a subgroup. It does seem like the compiler should be able to generate code for a break; which is equivalently performant as a loop exit flag though.
It would be interesting to boil down to a very simple example and examine the resulting generated code.