oneapi-src / oneDAL

oneAPI Data Analytics Library (oneDAL)
https://software.intel.com/en-us/oneapi/onedal
Apache License 2.0
617 stars 214 forks source link

ENH: Implement GPU version of spectral decomposition as linear regression fallback #2976

Closed david-cortes-intel closed 3 days ago

david-cortes-intel commented 1 week ago

Description

As a follow up from https://github.com/oneapi-src/oneDAL/pull/2930

This PR adds a GPU version of the fallback for linear regression based on spectral decomposition, which mirrors the CPU version. It reuses the same tests, but this time executing them with fp32 too.

A few things I wasn't sure about:


PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed. This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way. For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

Testing

Performance

Not applicable (addition is a fallback for cases that would have failed previously).

david-cortes-intel commented 1 week ago

/intelci: run

david-cortes-intel commented 1 week ago

Oddly, the first version of this function (minimum of the diagonal) that I wrote seems to succeed in some tests, but fail in others (particularly when it it called repeatedly, as in incremental linear regression):

template <typename Float>
Float diagonal_minimum(sycl::queue& queue,
                       const Float* Matrix,
                       const std::int64_t dim_matrix,
                       sycl::event& event_Matrix) {
    constexpr auto alloc = sycl::usm::alloc::device;
    auto diag_min_holder = array<Float>::empty(queue, 1, alloc);
    sycl::event diag_min_event = queue.submit([&](sycl::handler& h) {
        auto min_reduction = sycl::reduction(diag_min_holder.get_mutable_data(), sycl::minimum<>());
        h.depends_on(event_Matrix);
        h.parallel_for(dim_matrix, min_reduction, [=](const auto& i, auto& min_obj) {
            min_obj.combine(Matrix[i * (dim_matrix + 1)]);
        });
    });
    return ndview<Float, 1>::wrap(diag_min_holder).at_device(queue, 0, { diag_min_event });
}

(tried also adding event_Matrix to the dependencies of the return line, but it made no difference)

But this slower version doesn't seem to have any issue:

template <typename Float>
Float diagonal_minimum(sycl::queue& queue,
                       const Float* Matrix,
                       const std::int64_t dim_matrix,
                       sycl::event& event_Matrix) {
    constexpr auto alloc = sycl::usm::alloc::device;
    auto idx_min_holder = array<std::int64_t>::empty(queue, 1, alloc);
    sycl::event diag_min_event = mkl::blas::column_major::iamin(
        queue,
        dim_matrix,
        Matrix,
        dim_matrix + 1,
        idx_min_holder.get_mutable_data(),
        mkl::index_base::zero,
        { event_Matrix }
    );
    const std::int64_t idx_min = ndview<std::int64_t, 1>::wrap(idx_min_holder).at_device(queue, 0, { diag_min_event });
    return ndview<Float, 1>::wrap(Matrix, dim_matrix * dim_matrix).at_device(queue, idx_min * (dim_matrix + 1), { event_Matrix });
}

Any hints and what might be happening?

david-cortes-intel commented 1 week ago

/intelci: run

david-cortes-intel commented 1 week ago

Another thing that I'm now not very sure about is: should calls to oneMKL lapack manually call queue.wait_and_throw() in order to catch runtime exceptions from them?

This is what the oneMKL docs mention:

oneMKL error handling relies on the mechanism of C++ exceptions. Should error occur, it will be propagated at the point of a function call where it is caught using standard C++ error handling mechanism.

... which sounds like it would let it wait until the exception can be thrown, but then it leaves me wondering how it handles queueing.

david-cortes-intel commented 1 week ago

After some investigation regarding the diagonal_minimum function, turns out the issue with the function for diagonal minimum was that it wasn't initializing the reducer object, despite what the docs mention: https://github.khronos.org/SYCL_Reference/iface/reduction-variables.html#known-identities

Adding a manual initialization did the trick.

david-cortes-intel commented 1 week ago

/intelci: run

david-cortes-intel commented 1 week ago

/intelci: run

david-cortes-intel commented 1 week ago

/intelci: run

Vika-F commented 6 days ago

Another thing that I'm now not very sure about is: should calls to oneMKL lapack manually call queue.wait_and_throw() in order to catch runtime exceptions from them?

@david-cortes-intel Good point. I think yes, .wait_and_throw() is needed after the last oneMKL function call in try to catch the exception. Otherwise it can be missed as oneMKL calls run asynchronously on device.

david-cortes-intel commented 6 days ago

/intelci: run

david-cortes-intel commented 6 days ago

Another thing that I'm now not very sure about is: should calls to oneMKL lapack manually call queue.wait_and_throw() in order to catch runtime exceptions from them?

@david-cortes-intel Good point. I think yes, .wait_and_throw() is needed after the last oneMKL function call in try to catch the exception. Otherwise it can be missed as oneMKL calls run asynchronously on device.

Thanks for confirming. Added a manual call to wait_and_throw.

Vika-F commented 4 days ago

/intelci: run

david-cortes-intel commented 3 days ago

/intelci: run

david-cortes-intel commented 3 days ago

/intelci: run

david-cortes-intel commented 3 days ago

CI is passing now, will merge.