ginkgo-project / ginkgo

Numerical linear algebra software package
https://ginkgo-project.github.io/
BSD 3-Clause "New" or "Revised" License
401 stars 88 forks source link

Clean separation between functionality of core and device. #1240

Open pratikvn opened 1 year ago

pratikvn commented 1 year ago

This issue is in reference to the discussion regarding having sequential operations run on the host rather than on the device kernels (reference, openmp, cuda etc).

I would propose for having a clean separation between memory allocations/de-allocations and any operations that perform data manipulation. IMO, Ginkgo's philosophy has been to have the core orchestrate and dispatch the kernels, allocate and manage memory, but not perform any operations. This clean separation

  1. Allows us the benefit of easy extension to multiple backends and addition of new algorithms just to the kernels rather than modifying the core.
  2. A simpler logger and profiler interface and output, with operations and allocations clearly marked and distinguishable.
  3. Easier extension to task based approaches due to the distinct nature of the device kernels.

I understand that this is a bit challenging due to the fact that for many algorithms (SpGEMM, SpGEAM and factorizations), in develop (and also in release) we currently combine both allocations and operations because separation is more difficult in those cases, especially where the algorithm is inherently sequential.

upsj commented 1 year ago

Thanks for kicking off this discussion, I think our separation into core and device is a strong suit of our project that we should make sure to keep up. There are even a few cases (e.g. CsrBuilder) where we break up this separation, which in hindsight are not that well justified. I have some ideas on how to improve this, but that requires a bit more work.

I wanted to address a few of the individual points though, since I believe they may be a bit inaccurate in places:

  1. I think forcing allocations to only happen on the core side would blow up the complexity of core to an unjustified degree. For an example, look at matrix::Fbcsr::read or distributed::Matrix::read: There we have a large amount of intermediate data of a-priori unknown size that would require a lot of individual kernels, making the control flow really hard to follow or debug (especially since every kernel call goes through one level of macros and two levels of runtime dispatch)
  2. I can't really follow how this decision would impact profiling loggers. A profiler logger basically just annotates the execution timeline with events, it doesn't matter where they happen. NSight Systems and rocprof already resolve kernel launches inside the ranges, so we don't need to differentiate how we report host vs. device kernels.
  3. It is not really clear to me how host operations should make task-based execution any easier or harder. The more complicated problems of task dependencies and required data movement cannot be answered based on the function signatures alone, so they will require significant changes on the host side (maybe even more significant than on the device side) as well.

As an alternative description for the separation between core and kernels, I would propose the IMO much more precise distinction "does it only have a single implementation or does it have multiple implementations". There is no need to talk about abstract concepts of what is high-level and low-level if we already have a suitable technical justification for why we need this complex separation into different libraries, and where it is more of an obstacle.