Open zjin-lcf opened 1 year ago
Hello,
There is not a direct API match for this functionality, but you can construct such functionality using APIs in oneDPL with zip_iterator
, discard_iterator
and std::copy_if
.
It would look something like this:
template<typename Policy, typename InputIt, typename FlaggedIt, typename OutputIt>
void flagged(Policy&& policy, InputIt it, FlaggedIt flagged, OutputIt out, ::std::uint64_t size, ::std::uint64_t * num_copied)
{
auto zip_b = oneapi::dpl::make_zip_iterator(it, flagged);
auto zip_e = oneapi::dpl::make_zip_iterator(it + size, flagged + size);
auto out_it = oneapi::dpl::make_zip_iterator(out, oneapi::dpl::discard_iterator{});
auto end_it = ::std::copy_if(::std::forward<Policy>(policy), zip_b, zip_e, out_it, [](auto const & x) {
return ::std::get<1>(x);
});
*num_copied = ::std::distance(out_it, end_it);
}
Let us know if this works for you and if you have any additional questions!
Thanks for the definition of the function. Will a similar function be added to oneDPL ?
There is a PR #276 which would add transform_if
to oneDPL which would be a closer match. I believe it could be used to achieve an equivalent for flagged
more easily than the solution provided above using a no-op unary function for the transform. The PR has been dormant for some time, but still has some interest in it and may still be included into oneDPL. Would that be helpful for your purposes?
It would be helpful for developers to migrate the CUB function with a single function in oneDPL. The CUB function is called in Pytorch and Tensorflow:
pytorch/caffe2/operators/generate_proposals_op.cu pytorch/aten/src/ATen/native/cuda/Nonzero.cu pytorch/caffe2/operators/boolean_mask_ops.cu
tensorflow/core/kernels/batched_non_max_suppression_op.cu.cc tensorflow/core/kernels/non_max_suppression_op.cu.cc tensorflow/core/kernels/generate_box_proposals_op.cu.cc tensorflow/core/kernels/where_op_gpu.cu.h
The example using your solution might be useful for you to evaluate performance portability.
https://github.com/zjin-lcf/HeCBench/tree/master/nonzero-cuda https://github.com/zjin-lcf/HeCBench/tree/master/nonzero-sycl
I was mistaken in my reference to PR #276. This does PR does not in fact map well to the requested algorithm. It does not compact the output to only those for which the predicate is true as flagged
does. Instead of omitting the indices where the predicate is false, it instead propagates the initial value of the output iterator at that index.
However, there is a helper function in SYCLomatic helper headers which provides dpct::copy_if
which implements most of the mentioned function above, and allows for a simpler one line migration.
I came across a CUB function shown below. I am not sure if oneDPL provides a functionally equivalent API. Thank you.