oneapi-src / oneDPL

oneAPI DPC++ Library (oneDPL) https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-library.html
Apache License 2.0
724 stars 113 forks source link

Cannot use device iterators in oneDPL algorithms #855

Open BenBrock opened 1 year ago

BenBrock commented 1 year ago

I'd like to pass device iterators---by which I mean random access iterators that work in device kernels---into oneDPL algorithms. Currently this doesn't work.

Here's a minimal example of what I'd like to do:

// Allocate USM device buffer
int* x_d = sycl::malloc_device<int>(100, q);

// Fill buffer pointed to by `x_d` with data.
. . .

// Create span from buffer
std::span<int> x(x_d, 100);

// Pass iterators from `std::span` into oneDPL reduce.
auto sum = oneapi::dpl::reduce(oneapi::dpl::execution::make_device_policy(q),
                               x.begin(), x.end(), 0, std::plus());

[full code tarball]

Here, instead of passing an int* to a USM device buffer into oneDPL reduce, I'm passing in the iterator type of std::span, which happens to be GCC's __normal_iterator. Currently, this results in a seg fault, I believe because oneDPL is creating a CPU-side copy of the buffer before launching the algorithm. (And the CPU-side access of a USM device allocation causes a seg fault.) Looking through some of the oneDPL code, it seems like this is what happens with most iterators, except for raw pointers and some special iterator types.

In this specific example, I could of course call .data() instead of .begin() to get raw pointers, which would have the desired behavior. However, I'm interested in using more complicated device iterator types that can't be represented by raw pointers.

Is there any way to have oneDPL directly launch the kernel with my iterators, instead of copying the data CPU-side?

MikeDvorskiy commented 1 year ago

Talking about any containers based on USM - oneDPL supports just std::vector with USM allocator.

https://oneapi-src.github.io/oneDPL/parallel_api/pass_data_algorithms.html

Pass Data to Algorithms
You can use one of the following ways to pass data to an algorithm executed with a device policy:
- oneapi:dpl::begin and oneapi::dpl::end functions
- Unified shared memory (USM) pointers and std::vector with USM allocators
- Iterators of host-side std::vector
MikeDvorskiy commented 1 year ago

Is there any way to have oneDPL directly launch the kernel with my iterators, instead of copying the data CPU-side?

Currently, just USM pointers, std::vector<..., USM_allocator>::begin or begin/end over a sycl::buffer.

BenBrock commented 1 year ago

Thanks to @MikeDvorskiy for pointing out oneapi::dpl::__internal::is_passed_directly, which identifies whether an iterator can be directly passed into a SYCL kernel when a oneDPL algorithm is executed.

My proposal to address this issue, along with #854, is to introduce a direct_iterator that wraps device iterators before they are passed into oneDPL. The helper function make_direct_iterator() returns one of these direct_iterators, and achieves the desired behavior when used with oneDPL algorithms like the example above.

// Allocate USM device buffer
int* x_d = sycl::malloc_device<int>(100, q);

// Fill buffer pointed to by `x_d` with data.
. . .

// Create span from buffer
std::span<int> x(x_d, 100);

// Pass iterators from `std::span` into oneDPL reduce.
auto sum = oneapi::dpl::reduce(oneapi::dpl::execution::make_device_policy(q),
                               oneapi::dpl::make_direct_iterator(x.begin()),
                               oneapi::dpl::make_direct_iterator(x.end()),
                               0, std::plus());

I have implemented something similar internally in our distributed ranges codebase, but I've also written a quick draft of this in a PR.