ginkgo-project / ginkgo

Numerical linear algebra software package
https://ginkgo-project.github.io/
BSD 3-Clause "New" or "Revised" License
398 stars 87 forks source link

Question/Feature request: Hybrid CPU-GPU solvers, cooperating backends #462

Open klausbu opened 4 years ago

klausbu commented 4 years ago

I'd like to use GPUs occasionally as "boosters" to speed-up larger, usually distributed memory simulations which don't fit into GPU memory on nodes with a rather high core count as they are used for engineering development. I want to fully exploit the fp64 hardware capabilities for cases where mixed precision hybridization is not an option (at least not beyond preconditioning).

Let's say I have 32 cores and two GPUs, a scenario could be to split the case into 3 domains, allocating 1 to 30 cores and the OpenMP backend and each of the other two partitions to 1 core + 1 GPU to a GPU backend (CUDA or HIP as applicable).

Can the different backends cooperate?

hartwiganzt commented 4 years ago

@klausbu thank you for your question! It really depends on the hardware whether this is a reasonable setup. For example, in the Summit supercomputer, 98% of the performance is in the NVIDIA 100 GPUs, only 2% are in the IBM Power9 CPU. In other words, if you have one/two strong GPUs, you may want to use the CPU for communication / management only. But in general, you can use domain decomposition and solve subdomains on the CPU and the GPU in parallel. Maybe @pratikvn has more experience in this.

pratikvn commented 4 years ago

@klausbu , if your subdomains have distinct problems they need to solve and communicate between the solves, then this is possible but you would have to manage the communication yourself, eg through MPI. For example, if you wanted to use a domain decomposition method to split your problem into different subdomains, solve the local problems in parallel on those subdomains, communicate and iterate until convergence.

For your example case, you would create 3 executors, an omp executor and two cuda/hip executors and have three MPI ranks associated with each of the executors. Then on the local executors you would be able to use all the functionalities of Ginkgo (solvers/preconditioners/SpMV) independently, but between the executors, you would manage the communication through MPI functions yourself.

klausbu commented 4 years ago

@pratikvn, background: in supercomputing hardware configurations are either optimized for GPU computation or CPU computation. Engineering workstations in small businesses are usually different, using 32+ cores and 1-2 GPUs which offer some compute performance that's not leveraged. For larger cases unpublished hybrid CPU+GPU CG solvers show a speedup of 50%-150% depending on the CFD problem. Loadbalancing is achieved by adapting the domain decomposition, available GPU memory is a limiting factor.

The objective is to solve one decomposed fp64 problem, splitting the work across the CPU(s) and GPU(s). Hybrid mixed precision fp64(CPU)/fp32(GPU) computation isn't a problem, neither is pure GPU linear algebra computation as long as the problem fits into the GPU memory but I'd like to have a pure fp64 hybrid implementation too and am looking for a suitable library as a starting point so I don't have to reinvent the wheel from scratch.

pratikvn commented 4 years ago

@klausbu , If understand it correctly, you have one fp64 problem which you want to solve by splitting the work into CPU and GPU.

What I meant previously was that if you want to do this with a domain decomposition method, it is possible as you can have distinct local solvers on CPU/GPU's and then communicate after the local solves. But if you want to do something like a distributed Krylov solve, then that is currently not possible within our setup as one system solve can be handled by one executor(CPU/GPU) at one time.