Open MikeDvorskiy opened 1 year ago
To the same goal of future performance improvement of merge sort: It seems worthwhile to investigate leaf size changes based on total input size to avoid the extra copy from temporary data to output data, originally from a comment from @dmitriy-sobolev https://github.com/oneapi-src/oneDPL/pull/1098#discussion_r1305549820
Additionally, it would be good to re-investigate selection of leaf size and its effect on performance. I believe when this was looked at previously, the number of iterations of the merge phase was not being adjusted accordingly. I expect this should provide some benefit, especially in the context of avoiding the extra copy from temporary data.
Leaf size was enlarged as a part of https://github.com/oneapi-src/oneDPL/pull/1735. The performance benefit was significant for GPUs. However, the question regarding adjustment of a leaf size to avoid extra copy still remains open.
In the long term, a one group version of this could be a big improvement for small sizes of n, shrinking down to a single kernel launches from O(log(n)) kernel launches where we would be in a single or only a few workgroups anyway.
Originally posted by @danhoeflinger in https://github.com/oneapi-src/oneDPL/pull/1098#pullrequestreview-1594106537