Open sebastiangrimberg opened 1 year ago
You should run 1 MPI process per GPU (sounds like that is what you are trying to do). On systems like Perlmutter or Frontier etc, you can use the job scheduler to make sure that each MPI process only sees a single GPU device.
But if an MPI process sees multiple GPU devices, we do in STRUMPACK:
cudaSetDevice(rank % devs);
where rank
is the MPI rank and devs
is the number of GPUs, from cudaGetDeviceCount
.
What exactly do you mean with:
where each process is assigned in my own code to devices 0 and 1
?
What exactly do you mean with:
where each process is assigned in my own code to devices 0 and 1
?
I mean I am calling cudaSetDevice(device_id);
in my application code, before (and after) calling STRUMPACK since I'm doing some matrix assembly on the GPU.
I'm not sure why this isn't working though. It seems like the strategy from inside of STRUMPACK to call cudaSetDevice
is probably setting the same exact GPU device as I have set in my code beforehand, even if it shouldn't make a difference in this circumstance.
Could it be due to MAGMA? Can you try without MAGMA?
I think the code at this line is not executed when running without MAGMA:
CUDA assertion failed: invalid resource handle /build/extern/STRUMPACK/src/dense/CUDAWrapper.cu 112
Hm, odd. Building without MAGMA the error message I get is:
CUDA error: (cudaMemcpy(dst, src, bytes, cudaMemcpyDeviceToHost)) failed with error:
--> an illegal memory access was encountered
which occurs outside of the STRUMPACK solve. Again, everything is totally fine with using SuperLU_DIST (which is using GPU) or a CPU-based direct solve. It seems like maybe STRUMPACK is corrupting memory somewhere?
I'll see if I can find a machine with multiple GPUs per node. Maybe I can reproduce on Permutter.
Awesome, thanks! For reference, I'm running on AWS, on a single p3.8xlarge EC2 instance (4 x V100 GPU).
I can reproduce it on Perlmutter, using setup (1): https://docs.nersc.gov/systems/perlmutter/running-jobs/#1-node-4-tasks-4-gpus-all-gpus-visible-to-all-tasks With setup (2) (https://docs.nersc.gov/systems/perlmutter/running-jobs/#1-node-4-tasks-4-gpus-1-gpu-visible-to-each-task) it works fine.
The only difference between setups (1) and (2) is that for (1) STRUMPACK calls cudaSetDevice
.
I did also notice that it runs fine with setup (1) when I add export OMP_NUM_THREADS=1
.
I will investigate further.
You say it works with SuperLU. Did you set SUPERLU_BIND_MPI_GPU
?
Awesome, thank you for your help with this and great that you can reproduce. No I did not set SUPERLU_BIND_MPI_GPU
when testing with SuperLU_DIST. That looks like it controls whether or not SuperLU will call cudaSetDevice
during setup, so this is consistent with your findings that STRUMPACK works OK when not calling cudaSetDevice
. It's interesting that OMP_NUM_THREADS
also affects the result, I was also running with OMP_NUM_THREADS=1
for my tests. I'll spend some time looking into this as well so let me know if I can be of help in any way.
It works correctly with SuperLU with SUPERLU_BIND_MPI_GPU
.
I can't figure out what is wrong. I know that calling cudaSetDevice
will reset the device, and then all streams etc will be invalid. But cudaSetDevice
is the first CUDA call in STRUMPACK.
I thought it might be due to CUDA aware MPI, so I tried to disable that, but that doesn't make a difference.
Perhaps you can set CUDA_VISIBLE_DEVICES
, but you need to find a way to set that to a different value for different MPI ranks.
I wonder if the issue is somehow an interplay between STRUMPACK and SLATE (I noticed SLATE also has calls to cudaSetDevice
internally). I'm not super familiar with how STRUMPACK is using SLATE but this is definitely a differentiator vs. SuperLU.
I also see the issue without linking with SLATE (or MAGMA).
I'm having some issues running STRUMPACK with more than a single GPU per node. By comparison, SuperLU_DIST is fine. This is with CUDA. In particular, if I run with 2 MPI processes (not CUDA aware), where each process is assigned in my own code to devices 0 and 1, I get:
Is there anything special we need to do building STRUMPACK + MPI with CUDA support?