Open jinghu4 opened 11 hours ago
The GPU =1
means that GPU is enabled, otherwise it would be GPU =0
.
Sorry that is confusing, I will fix that.
The OMP deprecation message is probably coming from the SLATE library.
I believe the invalid resource handle message is because multiple mpi processes are using the same GPU, and so it is using more CUDA streams than allowed per GPU.
This changes the GPU =1
to GPU enabled
:
https://github.com/pghysels/STRUMPACK/commit/115b152be9a5d0d77846e3694f699c53c93fe394
When you run with P
mpi ranks on a machine with D
GPUs, mpi rank p
will use device d = p % D
.
Yes. But what I have confused is that we I run
mpirun -n 2 myApplication
All 8 gpus on the node run has these two processes Id running.
Even when I use cudaSetDevice to assign rank 0 to gpu 0 and rank 1 to gpu 1. I can still see two processes running on both rank 0 and rank1.
Hmm, I'm not sure. STRUMPACK calls cudaSetDevice , see here https://github.com/pghysels/STRUMPACK/blob/115b152be9a5d0d77846e3694f699c53c93fe394/src/dense/CUDAWrapper.cpp#L330 this is called form the SparseSolver constructor. So perhaps that changes what you specify. But it should not use all GPUs. Maybe SLATE is doing that?
You could try to set the CUDA_VISIBLE_DEVICES
environment variable. But you need to set it differently for each MPI rank.
You can do that by setting it in a small script, which you then run using mpirun, as explained here:
https://medium.com/@jeffrey_91423/binding-to-the-right-gpu-in-mpi-cuda-programs-263ac753d232
Hi, Dr. Ghysels,
I have seen some issues when using multi-GPU feature of STRUMPACK to solve a sparse matrix. I built STRUMPACK successfully with support of SLATE and MAGMA.
However, it passes when I run with one GPU: "
OMP_NUM_THREADS=1 mpirun -n 1 test_structure_reuse_mpi pde900.mtx
a) sometimes it passes
(Why GPU =1 here? Does it mean, it only use one GPU but two processes are run on each og gpus I request? )
b) sometimes it fails with error msg
Do you know what the reasons could be, causing these issues and how should I resolve them?
Best, -Jing