JackAKirk commented 4 days ago

Summary

Linear algebra operators in oneMKL lapack that return computation error (e.g. for matrix operations such as inversion (e.g. getri) that may not have a solution) return this error via an exception ([oneapi::mkl::lapack::computation_error](https://oneapi-spec.uxlfoundation.org/specifications/oneapi/latest/elements/onemkl/source/architecture/architecture#onemkl-lapack-exception-computation-error)). To achieve this there is a implementation constraint that such functions as getri are synchronous, since they generally don't know this error code until completion. This means that even if (for example) a programmer inputs a matrix that does have a valid solution for the given operation (e.g. a matrix that is non-singular for an inverse operation), the user is forced to have all work wait on the return of this synchronous operation to check for an error code that is irrelevant. This affects are large proportion (maybe most?) of oneMKL lapacks most computationally intensive functions. Any workload using these functions will be severely bottlenecked with respect to asynchronous performance.

However native libraries such as cusolver (that oneMKL uses), can return this "computation error" information via a return value that is returned asynchronously. Therefore a change to the oneMKL specification would fix this issue.

Problem statement

Provide asynchronous oneMKL interfaces for Linear algebra operators that currently return "computation error" exceptions.

Details

oneMKL will need to remove the [oneapi::mkl::lapack::computation_error](https://oneapi-spec.uxlfoundation.org/specifications/oneapi/latest/elements/onemkl/source/architecture/architecture#onemkl-lapack-exception-computation-error) exception, and replace it with either:

Probably the only sensible solution: an extra parameter for each function that returns such an exception, that instead returns "SomethingInfo" asynchronously, that provides this computational error info: mapping one to one with e.g. cusolver.
Some kind of solution with SYCL asynchronous exceptions: I'm not sure if this is possible but could be looked into. AFAIK currently sycl asynchronous exceptions are completely unused.

ericlars commented 4 days ago

Hi @JackAKirk, thanks for the RFC.

The oneMKL LAPACK team has had an ongoing discussion on the issue you raise which I'll summarize here. We agree with your assessment of the blocking nature of using exceptions for computation errors and find it entirely reasonable to replace them with info variables (or arrays in the batch case).

Some kind of solution with SYCL asynchronous exceptions: I'm not sure if this is possible but could be looked into. AFAIK currently sycl asynchronous exceptions are completely unused.

SYCL does not allow exceptions to be thrown in kernel scope, we're only aware of the possibility to throw asynchronous exceptions from host_tasks which limits their usefulness.

Provide asynchronous oneMKL interfaces for Linear algebra operators that currently return "computation error" exceptions

Exception handling of computation errors is not the only blocker for asynchronous behavior. As we understand it, SYCL provides host_task for scheduling CPU tasks with device tasks. A limitation of host_task is that it is undefined behavior to capture queues or events, so even if a kernel updates an info variable it is not possible to asynchronously schedule a task conditioned on the outcome of a prior kernel within the SYCL framework.

Furthermore, several oneMKL LAPACK functions do not lend themselves to performant GPU-only implementations and so perform some critical sections on the CPU. While the GPU portions are bound to the context provided by the SYCL queue, the CPU portions generally assume they have unfettered access to CPU resources. For these routines the benefit of asynchronicity is unclear to us.

JackAKirk commented 4 days ago

Thanks for the quick reply!

Exception handling of computation errors is not the only blocker for asynchronous behavior. As we understand it, SYCL provides host_task for scheduling CPU tasks with device tasks. A limitation of host_task is that it is undefined behavior to capture queues or events, so even if a kernel updates an info variable it is not possible to asynchronously schedule a task conditioned on the outcome of a prior kernel within the SYCL framework.

oneMKL is a library and does not have to use only the existing sycl 2020 specification. In fact we have already solved this issue for the two backends that it affects via the enqueue_native_command dpc++ extension: please see https://github.com/oneapi-src/oneMKL/pull/572. As I understand it this completely resolves the issue you raise here.

Furthermore, several oneMKL LAPACK functions do not lend themselves to performant GPU-only implementations and so perform some critical sections on the CPU. While the GPU portions are bound to the context provided by the SYCL queue, the CPU portions generally assume they have unfettered access to CPU resources. For these routines the benefit of asynchronicity is unclear to us.

Sure I understand that certain functions (and/or certain backends) may not be able to take advantage of this. However the cusolver and rocsolver backends have a large number of functions to which such limitations do not currently exist; it also sounds like intel backends at least have a few cases that could take advantage of such an improved interface? And I expect that future generations of intel implementations will improve on this current situations?.

ericlars commented 4 days ago

Glad to hear the host_task issues have been worked around, if at least for some backends. We support this change; do you plan on driving the spec update over on https://github.com/uxlfoundation/oneAPI-spec?

JackAKirk commented 2 days ago

Glad to hear the host_task issues have been worked around, if at least for some backends. We support this change; do you plan on driving the spec update over on https://github.com/uxlfoundation/oneAPI-spec?

@Ruyk could I work on this? these linear algebra operators are used in pytorch and already they are hooked up to intel python's numpy implementation: https://github.com/IntelPython/dpnp

oneapi-src / oneMKL

[Specification] oneMKL lapack to allow asynchronous functions #589

Summary

Problem statement

Details