memory allocation of oneapi::dpl::reduce()

oneapi-src / oneDPL

oneAPI DPC++ Library (oneDPL) https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-library.html

Apache License 2.0

721 stars 113 forks source link

memory allocation of oneapi::dpl::reduce() #1891

Closed jinz2014 closed 2 weeks ago

jinz2014 commented 3 weeks ago

Profiling shows that the "reduce" function allocates memory when it is called. Could the memory allocation occur once ?

This is the CUDA CUB sample:


  // Determine temporary device storage requirements
  void     *d_temp_storage = nullptr;
  size_t   temp_storage_bytes = 0;
  cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, ..., );

  // Allocate temporary storage
  if (temp_storage_bytes != 0)
    cudaMalloc(&d_temp_storage, temp_storage_bytes);

  for (i = 0; i < iter; i++) {
    ...
    cub::DeviceReduce::Sum(...)
    ...
  }

akukanov commented 2 weeks ago

The semantics of CUB and oneDPL algorithm functions differ in multiple aspects.

oneDPL follows the C++ standard when defining its functionality, particularly the parallel algorithms. There are only parameters for input and output data; any temporary storage needs to be allocated internally by the implementation. In that regard, oneDPL is comparable to Thrust rather than CUB.

For more low-level, performance tuning oriented semantics like those of CUB, we recently started adding experimental "kernel template" APIs. This is still at early stage (not sufficient to cover your request) and will continue evolving, based on feedback.

jinz2014 commented 2 weeks ago

I see the difference. Thanks.