oneDPL function for CUB's DeviceSelect::Flagged

oneapi-src / oneDPL

oneAPI DPC++ Library (oneDPL) https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-library.html

Apache License 2.0

722 stars 113 forks source link

oneDPL function for CUB's DeviceSelect::Flagged #814

Open zjin-lcf opened 1 year ago

zjin-lcf commented 1 year ago

I came across a CUB function shown below. I am not sure if oneDPL provides a functionally equivalent API. Thank you.

#include <cub/cub.cuh>       // or equivalently <cub/device/device_select.cuh>
// Declare, allocate, and initialize device-accessible pointers for input, flags, and output
int  num_items;              // e.g., 8
int  *d_in;                  // e.g., [1, 2, 3, 4, 5, 6, 7, 8]
char *d_flags;               // e.g., [1, 0, 0, 1, 0, 1, 1, 0]
int  *d_out;                 // e.g., [ ,  ,  ,  ,  ,  ,  ,  ]
int  *d_num_selected_out;    // e.g., [ ]
...
// Determine temporary device storage requirements
void     *d_temp_storage = NULL;
size_t   temp_storage_bytes = 0;
cub::DeviceSelect::Flagged(d_temp_storage, temp_storage_bytes, d_in, d_flags, d_out, d_num_selected_out, num_items);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run selection
cub::DeviceSelect::Flagged(d_temp_storage, temp_storage_bytes, d_in, d_flags, d_out, d_num_selected_out, num_items);
// d_out                 <-- [1, 4, 6, 7]
// d_num_selected_out    <-- [4]

danhoeflinger commented 1 year ago

Hello, There is not a direct API match for this functionality, but you can construct such functionality using APIs in oneDPL with zip_iterator, discard_iterator and std::copy_if.

It would look something like this:

template<typename Policy, typename InputIt, typename FlaggedIt, typename OutputIt>
void flagged(Policy&& policy, InputIt it, FlaggedIt flagged, OutputIt out,  ::std::uint64_t size, ::std::uint64_t * num_copied)
{
    auto zip_b = oneapi::dpl::make_zip_iterator(it, flagged);
    auto zip_e = oneapi::dpl::make_zip_iterator(it + size, flagged + size);
    auto out_it = oneapi::dpl::make_zip_iterator(out, oneapi::dpl::discard_iterator{});

    auto end_it = ::std::copy_if(::std::forward<Policy>(policy), zip_b, zip_e, out_it, [](auto const & x) {
        return ::std::get<1>(x);
    });

    *num_copied = ::std::distance(out_it, end_it);
}

Let us know if this works for you and if you have any additional questions!

zjin-lcf commented 1 year ago

Thanks for the definition of the function. Will a similar function be added to oneDPL ?

danhoeflinger commented 1 year ago

There is a PR #276 which would add transform_if to oneDPL which would be a closer match. I believe it could be used to achieve an equivalent for flagged more easily than the solution provided above using a no-op unary function for the transform. The PR has been dormant for some time, but still has some interest in it and may still be included into oneDPL. Would that be helpful for your purposes?

zjin-lcf commented 1 year ago

It would be helpful for developers to migrate the CUB function with a single function in oneDPL. The CUB function is called in Pytorch and Tensorflow:

pytorch/caffe2/operators/generate_proposals_op.cu pytorch/aten/src/ATen/native/cuda/Nonzero.cu pytorch/caffe2/operators/boolean_mask_ops.cu

tensorflow/core/kernels/batched_non_max_suppression_op.cu.cc tensorflow/core/kernels/non_max_suppression_op.cu.cc tensorflow/core/kernels/generate_box_proposals_op.cu.cc tensorflow/core/kernels/where_op_gpu.cu.h

zjin-lcf commented 1 year ago

The example using your solution might be useful for you to evaluate performance portability.

https://github.com/zjin-lcf/HeCBench/tree/master/nonzero-cuda https://github.com/zjin-lcf/HeCBench/tree/master/nonzero-sycl

danhoeflinger commented 1 year ago

I was mistaken in my reference to PR #276. This does PR does not in fact map well to the requested algorithm. It does not compact the output to only those for which the predicate is true as flagged does. Instead of omitting the indices where the predicate is false, it instead propagates the initial value of the output iterator at that index.

However, there is a helper function in SYCLomatic helper headers which provides dpct::copy_if which implements most of the mentioned function above, and allows for a simpler one line migration.