Open BenBrock opened 1 year ago
Talking about any containers based on USM - oneDPL supports just std::vector with USM allocator.
https://oneapi-src.github.io/oneDPL/parallel_api/pass_data_algorithms.html
Pass Data to Algorithms
You can use one of the following ways to pass data to an algorithm executed with a device policy:
- oneapi:dpl::begin and oneapi::dpl::end functions
- Unified shared memory (USM) pointers and std::vector with USM allocators
- Iterators of host-side std::vector
Is there any way to have oneDPL directly launch the kernel with my iterators, instead of copying the data CPU-side?
Currently, just USM pointers, std::vector<..., USM_allocator>::begin or begin/end over a sycl::buffer.
Thanks to @MikeDvorskiy for pointing out oneapi::dpl::__internal::is_passed_directly
, which identifies whether an iterator can be directly passed into a SYCL kernel when a oneDPL algorithm is executed.
My proposal to address this issue, along with #854, is to introduce a direct_iterator
that wraps device iterators before they are passed into oneDPL. The helper function make_direct_iterator()
returns one of these direct_iterator
s, and achieves the desired behavior when used with oneDPL algorithms like the example above.
// Allocate USM device buffer
int* x_d = sycl::malloc_device<int>(100, q);
// Fill buffer pointed to by `x_d` with data.
. . .
// Create span from buffer
std::span<int> x(x_d, 100);
// Pass iterators from `std::span` into oneDPL reduce.
auto sum = oneapi::dpl::reduce(oneapi::dpl::execution::make_device_policy(q),
oneapi::dpl::make_direct_iterator(x.begin()),
oneapi::dpl::make_direct_iterator(x.end()),
0, std::plus());
I have implemented something similar internally in our distributed ranges codebase, but I've also written a quick draft of this in a PR.
I'd like to pass device iterators---by which I mean random access iterators that work in device kernels---into oneDPL algorithms. Currently this doesn't work.
Here's a minimal example of what I'd like to do:
[full code tarball]
Here, instead of passing an
int*
to a USM device buffer into oneDPL reduce, I'm passing in the iterator type ofstd::span
, which happens to be GCC's__normal_iterator
. Currently, this results in a seg fault, I believe because oneDPL is creating a CPU-side copy of the buffer before launching the algorithm. (And the CPU-side access of a USM device allocation causes a seg fault.) Looking through some of the oneDPL code, it seems like this is what happens with most iterators, except for raw pointers and some special iterator types.In this specific example, I could of course call
.data()
instead of.begin()
to get raw pointers, which would have the desired behavior. However, I'm interested in using more complicated device iterator types that can't be represented by raw pointers.Is there any way to have oneDPL directly launch the kernel with my iterators, instead of copying the data CPU-side?