oneapi-src / oneDPL

oneAPI DPC++ Library (oneDPL) https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-library.html
Apache License 2.0
720 stars 114 forks source link

merge-sort: optimize leaf stage #1735

Closed dmitriy-sobolev closed 4 weeks ago

dmitriy-sobolev commented 1 month ago

Process more items at the leaf-sort stage. It substantially improves the performance on GPU devices.

The proposed implementation is the most genetic since it sorts the data stably, and relies only on the move-assignability and move-copyability of the type (according to the sort algorithm requirements). It can be specialized in the future.

dmitriy-sobolev commented 1 month ago

I've found that the changes result in incorrect sort on Nvidia GPUs. I am debugging it.

Upd. It's been fixed in https://github.com/oneapi-src/oneDPL/pull/1735/commits/d92c9de5a6347571d8993ceffb11cfd82838dc9e.

SergeyKopienko commented 1 month ago

@dmitriy-sobolev , in this PR you have two calls of depends_on : mandatory required to capture their params by value (now it captured by reference so it's will work incorrectly). UPD: already fixed.

dmitriy-sobolev commented 1 month ago

@dmitriy-sobolev , in this PR you have two calls of depends_on : mandatory required to capture their params by value (now it captured by reference so it's will work incorrectly).

Thanks. I've fixed it.

SergeyKopienko commented 1 month ago

@dmitriy-sobolev please take a look to the code like this:

                            const auto& __rng1 = oneapi::dpl::__ranges::drop_view_simple(__dst, __offset);
                            const auto& __rng2 = oneapi::dpl::__ranges::drop_view_simple(__dst, __offset + __n1);

1) & isn't required here; 2) Really we create the new instance of drop_view_simple in this code, so I propose to rewrite it:

                            const oneapi::dpl::__ranges::drop_view_simple __rng1(__dst, __offset);
                            const oneapi::dpl::__ranges::drop_view_simple __rng2(__dst, __offset + __n1);

There are four places...

dmitriy-sobolev commented 1 month ago

@dmitriy-sobolev please take a look to the code like this:

                            const auto& __rng1 = oneapi::dpl::__ranges::drop_view_simple(__dst, __offset);
                            const auto& __rng2 = oneapi::dpl::__ranges::drop_view_simple(__dst, __offset + __n1);
  1. & isn't required here;
  2. Really we create the new instance of drop_view_simple in this code, so I propose to rewrite it:
                            const oneapi::dpl::__ranges::drop_view_simple __rng1(__dst, __offset);
                            const oneapi::dpl::__ranges::drop_view_simple __rng2(__dst, __offset + __n1);

There are four places...

Done

akukanov commented 4 weeks ago

Is it critical for performance that data-per-workitem is a template parameter of __leaf_sorter and __group_merge_path_sorter, instead of a runtime parameter? I am concerned about the proliferation of kernel instantiations, and the only reason why leaf sorters have different C++ type is that parameter.

dmitriy-sobolev commented 4 weeks ago

Is it critical for performance that data-per-workitem is a template parameter of __leaf_sorter and __group_merge_path_sorter, instead of a runtime parameter? I am concerned about the proliferation of kernel instantiations, and the only reason why leaf sorters have different C++ type is that parameter.

No it is not. As far as I remember the performance impact is marginal. Probably, it is better to get rid of it for now.