Closed jinz2014 closed 2 weeks ago
The semantics of CUB and oneDPL algorithm functions differ in multiple aspects.
oneDPL follows the C++ standard when defining its functionality, particularly the parallel algorithms. There are only parameters for input and output data; any temporary storage needs to be allocated internally by the implementation. In that regard, oneDPL is comparable to Thrust rather than CUB.
For more low-level, performance tuning oriented semantics like those of CUB, we recently started adding experimental "kernel template" APIs. This is still at early stage (not sufficient to cover your request) and will continue evolving, based on feedback.
I see the difference. Thanks.
Profiling shows that the "reduce" function allocates memory when it is called. Could the memory allocation occur once ?
This is the CUDA CUB sample: