STRUMPACK with > 1 GPU per node

sebastiangrimberg commented 1 year ago

I'm having some issues running STRUMPACK with more than a single GPU per node. By comparison, SuperLU_DIST is fine. This is with CUDA. In particular, if I run with 2 MPI processes (not CUDA aware), where each process is assigned in my own code to devices 0 and 1, I get:

CUDA assertion failed: invalid resource handle /build/extern/STRUMPACK/src/dense/CUDAWrapper.cu 112

Is there anything special we need to do building STRUMPACK + MPI with CUDA support?

pghysels commented 1 year ago

You should run 1 MPI process per GPU (sounds like that is what you are trying to do). On systems like Perlmutter or Frontier etc, you can use the job scheduler to make sure that each MPI process only sees a single GPU device.

But if an MPI process sees multiple GPU devices, we do in STRUMPACK: cudaSetDevice(rank % devs); where rank is the MPI rank and devs is the number of GPUs, from cudaGetDeviceCount.

What exactly do you mean with: where each process is assigned in my own code to devices 0 and 1 ?

sebastiangrimberg commented 1 year ago

What exactly do you mean with: where each process is assigned in my own code to devices 0 and 1 ?

I mean I am calling cudaSetDevice(device_id); in my application code, before (and after) calling STRUMPACK since I'm doing some matrix assembly on the GPU.

I'm not sure why this isn't working though. It seems like the strategy from inside of STRUMPACK to call cudaSetDevice is probably setting the same exact GPU device as I have set in my code beforehand, even if it shouldn't make a difference in this circumstance.

pghysels commented 1 year ago

Could it be due to MAGMA? Can you try without MAGMA?

I think the code at this line is not executed when running without MAGMA: CUDA assertion failed: invalid resource handle /build/extern/STRUMPACK/src/dense/CUDAWrapper.cu 112

sebastiangrimberg commented 1 year ago

Hm, odd. Building without MAGMA the error message I get is:

CUDA error: (cudaMemcpy(dst, src, bytes, cudaMemcpyDeviceToHost)) failed with error:
 --> an illegal memory access was encountered

which occurs outside of the STRUMPACK solve. Again, everything is totally fine with using SuperLU_DIST (which is using GPU) or a CPU-based direct solve. It seems like maybe STRUMPACK is corrupting memory somewhere?

pghysels commented 1 year ago

I'll see if I can find a machine with multiple GPUs per node. Maybe I can reproduce on Permutter.

sebastiangrimberg commented 1 year ago

Awesome, thanks! For reference, I'm running on AWS, on a single p3.8xlarge EC2 instance (4 x V100 GPU).

pghysels commented 1 year ago

I can reproduce it on Perlmutter, using setup (1): https://docs.nersc.gov/systems/perlmutter/running-jobs/#1-node-4-tasks-4-gpus-all-gpus-visible-to-all-tasks With setup (2) (https://docs.nersc.gov/systems/perlmutter/running-jobs/#1-node-4-tasks-4-gpus-1-gpu-visible-to-each-task) it works fine.

The only difference between setups (1) and (2) is that for (1) STRUMPACK calls cudaSetDevice.

I did also notice that it runs fine with setup (1) when I add export OMP_NUM_THREADS=1.

I will investigate further.

You say it works with SuperLU. Did you set SUPERLU_BIND_MPI_GPU?

sebastiangrimberg commented 1 year ago

Awesome, thank you for your help with this and great that you can reproduce. No I did not set SUPERLU_BIND_MPI_GPU when testing with SuperLU_DIST. That looks like it controls whether or not SuperLU will call cudaSetDevice during setup, so this is consistent with your findings that STRUMPACK works OK when not calling cudaSetDevice. It's interesting that OMP_NUM_THREADS also affects the result, I was also running with OMP_NUM_THREADS=1 for my tests. I'll spend some time looking into this as well so let me know if I can be of help in any way.

pghysels commented 1 year ago

It works correctly with SuperLU with SUPERLU_BIND_MPI_GPU.

I can't figure out what is wrong. I know that calling cudaSetDevice will reset the device, and then all streams etc will be invalid. But cudaSetDevice is the first CUDA call in STRUMPACK. I thought it might be due to CUDA aware MPI, so I tried to disable that, but that doesn't make a difference.

Perhaps you can set CUDA_VISIBLE_DEVICES, but you need to find a way to set that to a different value for different MPI ranks.

sebastiangrimberg commented 1 year ago

I wonder if the issue is somehow an interplay between STRUMPACK and SLATE (I noticed SLATE also has calls to cudaSetDevice internally). I'm not super familiar with how STRUMPACK is using SLATE but this is definitely a differentiator vs. SuperLU.

pghysels commented 1 year ago

I also see the issue without linking with SLATE (or MAGMA).

pghysels / STRUMPACK

STRUMPACK with > 1 GPU per node #108