pghysels / STRUMPACK

Structured Matrix Package (LBNL)
http://portal.nersc.gov/project/sparse/strumpack/
Other
157 stars 35 forks source link

is there a way to pass and receive gpu pointers to linear solvers? #113

Open ImBlackMagic opened 6 months ago

ImBlackMagic commented 6 months ago

Hello!

I'm working on a project that needs a large sparse linear system solved each iteration of a simulation, this takes about 90% of each iterations time, the matrix is about 10.000x10.000 in size, with 126k elements, unsymmetric.

The final objective of the project is to have everything running on GPU (CUDA kernels for everything that isn't the linear solver), and, according to strumpack's documentation, the interface receives and returns host memory pointers, so I would need to pass from GPU to CPU, then strumpack inside would make the conversion twice, CPU -> GPU to calculate, and then GPU -> CPU to return the data.

So, I have a couple of questions:

Thanks in advance!

PD: I spent several hours staring at the source code to no avail, I guess I have yet to attain higher arcane powers

pghysels commented 6 months ago

Hi

No, at the moment all the input is from host memory.

The main code for GPU factorization is in src/sparse/fronts/FrontalMatrixGPU.cpp or src/sparse/fronts/FrontalMatrixMAGMA.cpp if MAGMA is enabled. Only the MAGMA code does the triangular solve on the GPU for now.

There are a number of steps in the code that are still done on the CPU, such as the application of the permutations (for fill reduction and static pivoting), and the scaling. This is why the input is still required on the CPU.

We will try to move everything to GPU in the future.

Pieter

ImBlackMagic commented 6 months ago

Thanks for your answer.

If I understand your reply correctly, do I also need to install magma to maximize the GPU steps? I didn't install MAGMA on my current setup, but it is still faster than what I had running previously.

Thanks again for your answer!

pghysels commented 6 months ago

Yes, we have two implementations for the numerical factorization, one using just the CUDA libraries, and another using MAGMA. For the factorization, the MAGMA code is slightly faster than the CUDA code. The MAGMA implementation also does the triangular solve on the GPU, if the triangular factors fit on the device (and when using a single MPI rank). If you do not configure with MAGMA, the triangular solve phase is always done on the CPU.

I think in the future we will drop the non-MAGMA implementation, since we are relying more and more on MAGMA.