Failure when using more than 1 GPU in STRUMPACK MPI

jinghu4 commented 11 hours ago

Hi, Dr. Ghysels,

I have seen some issues when using multi-GPU feature of STRUMPACK to solve a sparse matrix. I built STRUMPACK successfully with support of SLATE and MAGMA.

When I run the test cases in STRUMPACK, "make test", the sparse_mpi and reuse_structure_mpi both failed.

# multifrontal factorization:
#   - estimated memory usage (exact solver) = 0.178864 MB
#   - minimum pivot, sqrt(eps)*|A|_1 = 1.05367e-08
#   - replacing of small pivots is not enabled
CUDA assertion failed: invalid resource handle ~/STRUMPACK-v8.0.0/STRUMPACK-8.0.0/src/dense/CUDAWrapper.cu 114
[gpu01:2817703] *** Process received signal ***

However, it passes when I run with one GPU: " OMP_NUM_THREADS=1 mpirun -n 1 test_structure_reuse_mpi pde900.mtx

Random failure when solving a sparse matrix with STRUMPACK multi-gpu Example: I try using 2 GPUs:
```
mpirun -n 2 --mca pml ucx myApplication.exe
```

a) sometimes it passes

OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
# DenseMPI factorization complete, GPU=1, P=2, T=10: 0.170223 seconds, 0.00550864 GFLOPS, 0.0323613 GFLOP/s,  ds=203, du=0

(Why GPU =1 here? Does it mean, it only use one GPU but two processes are run on each og gpus I request? )

b) sometimes it fails with error msg

# multifrontal factorization:
#   - estimated memory usage (exact solver) = 23.5596 MB
#   - minimum pivot, sqrt(eps)*|A|_1 = 1.05367e-08
#   - replacing of small pivots is not enabled
cuSOLVER assertion failed: 6 ~/STRUMPACK-v8.0.0/STRUMPACK-8.0.0/src/dense/CUDAWrapper.cpp 614
CUSOLVER_STATUS_EXECUTION_FAILED

Do you know what the reasons could be, causing these issues and how should I resolve them?

Best, -Jing

pghysels commented 10 hours ago

The GPU =1 means that GPU is enabled, otherwise it would be GPU =0. Sorry that is confusing, I will fix that.

The OMP deprecation message is probably coming from the SLATE library.

I believe the invalid resource handle message is because multiple mpi processes are using the same GPU, and so it is using more CUDA streams than allowed per GPU.

pghysels commented 10 hours ago

This changes the GPU =1 to GPU enabled: https://github.com/pghysels/STRUMPACK/commit/115b152be9a5d0d77846e3694f699c53c93fe394

pghysels commented 9 hours ago

When you run with P mpi ranks on a machine with D GPUs, mpi rank p will use device d = p % D.

jinghu4 commented 9 hours ago

Yes. But what I have confused is that we I run

mpirun -n 2 myApplication

All 8 gpus on the node run has these two processes Id running.

Even when I use cudaSetDevice to assign rank 0 to gpu 0 and rank 1 to gpu 1. I can still see two processes running on both rank 0 and rank1.

pghysels commented 9 hours ago

Hmm, I'm not sure. STRUMPACK calls cudaSetDevice , see here https://github.com/pghysels/STRUMPACK/blob/115b152be9a5d0d77846e3694f699c53c93fe394/src/dense/CUDAWrapper.cpp#L330 this is called form the SparseSolver constructor. So perhaps that changes what you specify. But it should not use all GPUs. Maybe SLATE is doing that?

You could try to set the CUDA_VISIBLE_DEVICES environment variable. But you need to set it differently for each MPI rank. You can do that by setting it in a small script, which you then run using mpirun, as explained here: https://medium.com/@jeffrey_91423/binding-to-the-right-gpu-in-mpi-cuda-programs-263ac753d232

pghysels / STRUMPACK

Failure when using more than 1 GPU in STRUMPACK MPI #126