Closed raminammour closed 3 years ago
Here is a smaller reproducer (which again, works with 1 process) if it helps:
$JULIA_DEPOT_PATH/bin/mpiexecjl -n 2 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);select_device();@show sum(@zeros(siz,siz,siz))"
Global grid: 60x31x31 (nprocs: 2, dims: 2x1x1)
ERROR: Out of GPU memory trying to allocate 232.742 KiB
Effective GPU memory usage: 1.03% (416.000 MiB/39.586 GiB)
CUDA allocator usage: 0 bytes
Memory pool usage: 0 bytes (0 bytes allocated, 0 bytes cached)
Hi @raminammour,
Thanks for reporting. I just tried your MVE and it works fine for my both with -n 1
and -n 2
. Before we further dive into debugging, are you sure that running on more than one MPI process, each MPI process has access to its dedicated GPU (and thus be sure that both MPI process are not trying to init on the same GPU) ? This could be a reason for the error you get.
Also, what type of multi-GPU system are you running on, how many GPU per node ?
Here my output
[lraess@node]$ $mpirun_ -n 1 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);select_device();@show sum(@zeros(siz,siz,siz))"
Global grid: 31x31x31 (nprocs: 1, dims: 1x1x1)
sum(#= none:1 =# @zeros(siz, siz, siz)) = 0.0
[lraess@node]$ $mpirun_ -n 2 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);select_device();@show sum(@zeros(siz,siz,siz))"
Global grid: 60x31x31 (nprocs: 2, dims: 2x1x1)
sum(#= none:1 =# @zeros(siz, siz, siz)) = sum(#= none:1 =# @zeros(siz, siz, siz)) = 0.0
0.0
Hey, thanks for the reply.
These are nodes with 4 gpus, I reserve with slurm and just tried to be more explicit: srun -n 4 --gres=gpu:4 --cpus-per-gpu=1 --pty bash -l
. Is there a way to make sure that each mpi process has a dedicated gpu? (seems that the slurm installation we have does not recognize ntasks-per-gpu
or bind-gpu
options).
Welcome, maybe you could try with running -n 2
on 2 different nodes with 1 GPU per node to see if this fixes the error and confirms there is an issue regarding GPU overriding by multiple MPI processes on multi-GPU nodes?
Another try could be to add --ntasks-per-node=4
instead of --cpus-per-gpu=1
(but I don't have much experience with SLURM)
Is there a way to make sure that each mpi process has a dedicated gpu?
On the application side, it should be handled by select_device()
from here, which gets local MPI rank (per node) and assigns it as GPU_ID
. You could check it by printing the output as such (@show me=select_device()
):
$JULIA_DEPOT_PATH/bin/mpiexecjl -n 2 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);@show me=select_device();@show sum(@zeros(siz,siz,siz))"
Seems to be correctly selecting different devices
--------------------------------------------------------------------------
Global grid: 60x31x31 (nprocs: 2, dims: 2x1x1)
me = select_device() = 1
me = select_device() = 0
I simplified further and getting segfaults with CUDA
and MPI
interaction. I will investigate further, and come back here if the error is not coming from upstream.
For now feel free to close the issue if you see fit, or keep it open and I will notify you of any resolution.
Cheers!
Indeed, seems to be ok regarding device selection. Could you try running on different nodes ?
I simplified further and getting segfaults with CUDA and MPI interaction
That's not nice. Do you use CUDA-aware MPI ?
You certainly did already, but these two resources may give further hints on how to set-up the environment to run Multi-GPU:
yeah, MPI reports that it has_cuda()
but segfaults still :(
Could you maybe try export IGG_CUDAAWARE_MPI=0
to disable CUDA-awareness within ImplicitGlobalGrid.jl, and see if this would fix the segfault ?
Is the machine you are running on using MPICH as MPI ?
This machine is using OpenMPI.
export IGG_CUDAAWARE_MPI=0
does not fix the OOM error above.
Does the MPI install use UCX instead of openib ? I had issues getting MPI to work properly if OpenMPI was configured with UCX.
I am on a managed system, so cannot play with any installations unfortunately.
If you have multiple nodes, it would be interesting to see if the error occurs when running one one GPU/node on 2 nodes.
Also, do you get any errors (segfaults) running on 2 MPI processes with Thread
backend ?
It works with one gpu per node with two mpi processes, but not 2 gpus on one node!
Interesting.
I simplified further and getting segfaults with
CUDA
andMPI
interaction. I will investigate further, and come back here if the error is not coming from upstream.
Hi @raminammour , do I understand right that you could reproduce the issue without ImplicitGlobalGrid and ParallelStencil? Then, you would best write in the GPU section in Julia Discourse in order to get help. I would guess that something is wrong with the GPU memory allocation to your processes independent of what code you run...
Thank you both for helping!
Yes, @omlins, I was able to get a segfault just using MPI and CUDA with a broadcast/receive pattern. The OOM I am getting above sometimes leads to a segfault, which leads me to believe that there is UB behavior somewhere, as you say with the allocation of memory on the GPUs.
We have factorial combinations of cuda drivers, compilers, MPI's and different little things that are making my life difficult in building a reproducible example. Also the system is new, and these things get ironed out in the first few weeks in my experience (as admins correct things). When I manage to do so I will file upstream or discuss on discourse as you advise.
Feel free to close here for now, we can re-open if I determine that the issue pertains to ParallelStencil
or ImplicitGlobalGrid
, which are great, regardless :)
OK, I will close the issue then. Don't hesitate to open a topic on Julia discourse - it could well be that somebody will immediately know what's going on...
PS: thanks for your nice words about ParallelStencil and ImplicitGlobalGrid!
Hello,
Thank you for a great package! I am trying one of your examples and it runs fine with one mpi process (never runs out of memory). But fails on an OOM error with more mpi processes (I checked, the devices are less than 1% memory occupied).
With one process, success:
With 2 processes, failure:
I appreciate any hints you may have about why this is happening.
Cheers!