Open pratikvn opened 1 year ago
Thanks for kicking off this discussion, I think our separation into core and device is a strong suit of our project that we should make sure to keep up. There are even a few cases (e.g. CsrBuilder) where we break up this separation, which in hindsight are not that well justified. I have some ideas on how to improve this, but that requires a bit more work.
I wanted to address a few of the individual points though, since I believe they may be a bit inaccurate in places:
core
to an unjustified degree. For an example, look at matrix::Fbcsr::read or distributed::Matrix::read: There we have a large amount of intermediate data of a-priori unknown size that would require a lot of individual kernels, making the control flow really hard to follow or debug (especially since every kernel call goes through one level of macros and two levels of runtime dispatch)As an alternative description for the separation between core
and kernels, I would propose the IMO much more precise distinction "does it only have a single implementation or does it have multiple implementations". There is no need to talk about abstract concepts of what is high-level and low-level if we already have a suitable technical justification for why we need this complex separation into different libraries, and where it is more of an obstacle.
This issue is in reference to the discussion regarding having sequential operations run on the host rather than on the device kernels (reference, openmp, cuda etc).
I would propose for having a clean separation between memory allocations/de-allocations and any operations that perform data manipulation. IMO, Ginkgo's philosophy has been to have the core orchestrate and dispatch the kernels, allocate and manage memory, but not perform any operations. This clean separation
I understand that this is a bit challenging due to the fact that for many algorithms (SpGEMM, SpGEAM and factorizations), in develop (and also in release) we currently combine both allocations and operations because separation is more difficult in those cases, especially where the algorithm is inherently sequential.