Persistent working buffer for scans

al42and commented 4 months ago

Currently, SYCL scans allocate and deallocate memory each time they are called for a large enough array.

E.g., a code like this:

    sycl::queue q(dev, {sycl::property::queue::in_order()});
    const auto  policy = oneapi::dpl::execution::make_device_policy(q);

    for (int i  = 0; i < 20; i++) {
        q.fill<int>(arrayIn, 0, N);
        oneapi::dpl::experimental::exclusive_scan_async(policy, arrayIn, arrayIn + N, arrayOut, 0, sycl::plus<int>{});
    }
    q.wait_and_throw();

will allocate and then free device memory on every loop iteration.

While having the working buffer allocated automatically is a huge convenience for StdPar-like usage, it is sometimes preferable to avoid overheads even at the cost of complexity.

The same problem likely applies to some other algorithms, judging by the use of sycl::buffer variables in the same file, but I only looked at scans.

Proposed solution: have an API to query the required working buffer size for each algorithm given the input size, and then expand the API for StdPar-like operations to accept extra properties, one of which can be the working buffer pointer / object. In the future, other properties can be added to serve as execution hints?

Or somehow have the buffer persist automatically via the device policy?

danhoeflinger commented 4 months ago

@al42and Thank you very much for the feedback. This is something we are aware of as a potentially beneficial feature for users who are interested in improved performance at the cost of some convenience. Your post very clearly lays out this demand.

My expectation is that if oneDPL were to support something like this, it would be in the context of kernel template APIs. It matches with the mindset behind kernel templates, which is to give the user more control in the pursuit of better performance and at the cost of some generality. We are considering this feature as well as others to prioritize performance within that effort.

al42and commented 4 months ago

Hi @danhoeflinger,

My expectation is that if oneDPL were to support something like this, it would be in the context of kernel template APIs.

If you have a clear idea of how the API will look like, could you elaborate further, please? I can't understand how a runtime pointer can be passed this way. Or are you suggesting a flag "persist the working buffer past the launch and reuse the old one if it exists"?

We are considering this feature as well as others to prioritize performance within that effort.

To be clear: this is not a priority for us (GROMACS). So far, we are operating on an array of a fixed size of 8k elements, so it's a single kernel launch without any working buffers. But that size was set arbitrary, so this problem can become pressing eventually if we go past 16k, and I decide to raise the issue proactively.

danhoeflinger commented 4 months ago

@al42and I don't have specific information at this time about what the API would look like, but this issue of temporary memory allocation reuse something we are considering. The "kernel template" APIs are not fixed to the C++ standard libraries parallel algorithms specification and we envision API adjustments to support functionality like this. There are multiple possible approaches to support this feature, one of these options is the addition of an extra API to query temporary space required and extra runtime parameter(s) to accept externally allocated memory.

Separate from a potential oneDPL feature, there are ways to mitigate the performance penalty from these repeated temporary allocations currently available in the oneAPI DPC++ compiler. I suggest taking a look at the SYCL_PI_LEVEL_ZERO_USM_ALLOCATOR environment variable: https://intel.github.io/llvm-docs/EnvironmentVariables.html#debugging-variables-for-level-zero-plugin This environment variable allows the configuring of memory pool sizes used by the level zero USM allocator. For repeated oneDPL calls of the same size, this can help reduce the performance impact of this temporary allocation by reusing allocations from a memory pool. If you need help selecting values for this environment variable we can work with you on your specific use case, but some experimentation may be necessary.

When we have more information, we will update here. Thanks again for the feedback.

al42and commented 4 months ago

I suggest taking a look at the SYCL_PI_LEVEL_ZERO_USM_ALLOCATOR environment variable: intel.github.io/llvm-docs/EnvironmentVariables.html#debugging-variables-for-level-zero-plugin

That will not work for other backends, unfortunately; and even for L0 is not a user-friendly solution.

But thanks for suggesting it as a workaround, it could definitely be helpful during the development.

oneapi-src / oneDPL

Persistent working buffer for scans #1510