omlins / ParallelStencil.jl

Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs
BSD 3-Clause "New" or "Revised" License
312 stars 31 forks source link

OOM error when running with more than one mpi rank #27

Closed raminammour closed 3 years ago

raminammour commented 3 years ago

Hello,

Thank you for a great package! I am trying one of your examples and it runs fine with one mpi process (never runs out of memory). But fails on an OOM error with more mpi processes (I checked, the devices are less than 1% memory occupied).

With one process, success:

$JULIA_DEPOT_PATH/bin/mpiexecjl -n 1 julia --project=@. ./ParallelStencil/miwpY/miniapps/acoustic_waves_multixpu/acoustic3D_multixpu.jl
Global grid: 127x127x127 (nprocs: 1, dims: 1x1x1)
Animation directory: ./viz3D_out/
Total steps=1000, time=1.636e+01 sec (@ T_eff = 7.90 GB/s)

With 2 processes, failure:

$JULIA_DEPOT_PATH/bin/mpiexecjl -n 2 julia --project=@. ./ParallelStencil/miwpY/miniapps/acoustic_waves_multixpu/acoustic3D_multixpu.jl
Global grid: 252x127x127 (nprocs: 2, dims: 2x1x1)
ERROR: LoadError: Out of GPU memory trying to allocate 15.628 MiB
Effective GPU memory usage: 1.03% (416.000 MiB/39.586 GiB)
CUDA allocator usage: 0 bytes
Memory pool usage: 0 bytes (0 bytes allocated, 0 bytes cached)
Stacktrace:
 [1] #alloc#202
   @./CUDA/mVgLI/src/pool.jl:266 [inlined]
 [2] alloc
   @ ./CUDA/mVgLI/src/pool.jl:258 [inlined]
 [3] CUDA.CuArray{Float64, 3}(#unused#::UndefInitializer, dims::Tuple{Int64, Int64, Int64})
   @ ./CUDA/mVgLI/src/array.jl:28
 [4] CuArray
   @ ./CUDA/mVgLI/src/array.jl:109 [inlined]
 [5] CuArray
   @./CUDA/mVgLI/src/array.jl:110 [inlined]
 [6] zeros
   @ ./CUDA/mVgLI/src/array.jl:409 [inlined]
 [7] acoustic3D()
   @ Main ./ParallelStencil/miwpY/miniapps/acoustic_waves_multixpu/acoustic3D_multixpu.jl:40
 [8] top-level scope
   @ ./ParallelStencil/miwpY/miniapps/acoustic_waves_multixpu/acoustic3D_multixpu.jl:86

I appreciate any hints you may have about why this is happening.

Cheers!

raminammour commented 3 years ago

Here is a smaller reproducer (which again, works with 1 process) if it helps:

$JULIA_DEPOT_PATH/bin/mpiexecjl -n 2 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);select_device();@show sum(@zeros(siz,siz,siz))"
Global grid: 60x31x31 (nprocs: 2, dims: 2x1x1)
ERROR: Out of GPU memory trying to allocate 232.742 KiB
Effective GPU memory usage: 1.03% (416.000 MiB/39.586 GiB)
CUDA allocator usage: 0 bytes
Memory pool usage: 0 bytes (0 bytes allocated, 0 bytes cached)
luraess commented 3 years ago

Hi @raminammour, Thanks for reporting. I just tried your MVE and it works fine for my both with -n 1 and -n 2. Before we further dive into debugging, are you sure that running on more than one MPI process, each MPI process has access to its dedicated GPU (and thus be sure that both MPI process are not trying to init on the same GPU) ? This could be a reason for the error you get. Also, what type of multi-GPU system are you running on, how many GPU per node ?

Here my output

[lraess@node]$ $mpirun_ -n 1 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);select_device();@show sum(@zeros(siz,siz,siz))"
Global grid: 31x31x31 (nprocs: 1, dims: 1x1x1)
sum(#= none:1 =# @zeros(siz, siz, siz)) = 0.0

[lraess@node]$ $mpirun_ -n 2 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);select_device();@show sum(@zeros(siz,siz,siz))"
Global grid: 60x31x31 (nprocs: 2, dims: 2x1x1)
sum(#= none:1 =# @zeros(siz, siz, siz)) = sum(#= none:1 =# @zeros(siz, siz, siz)) = 0.0
0.0
raminammour commented 3 years ago

Hey, thanks for the reply.

These are nodes with 4 gpus, I reserve with slurm and just tried to be more explicit: srun -n 4 --gres=gpu:4 --cpus-per-gpu=1 --pty bash -l. Is there a way to make sure that each mpi process has a dedicated gpu? (seems that the slurm installation we have does not recognize ntasks-per-gpu or bind-gpu options).

luraess commented 3 years ago

Welcome, maybe you could try with running -n 2 on 2 different nodes with 1 GPU per node to see if this fixes the error and confirms there is an issue regarding GPU overriding by multiple MPI processes on multi-GPU nodes?

Another try could be to add --ntasks-per-node=4 instead of --cpus-per-gpu=1 (but I don't have much experience with SLURM)

Is there a way to make sure that each mpi process has a dedicated gpu?

On the application side, it should be handled by select_device() from here, which gets local MPI rank (per node) and assigns it as GPU_ID. You could check it by printing the output as such (@show me=select_device()):

$JULIA_DEPOT_PATH/bin/mpiexecjl -n 2 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);@show me=select_device();@show sum(@zeros(siz,siz,siz))"
raminammour commented 3 years ago

Seems to be correctly selecting different devices

--------------------------------------------------------------------------
Global grid: 60x31x31 (nprocs: 2, dims: 2x1x1)
me = select_device() = 1
me = select_device() = 0

I simplified further and getting segfaults with CUDA and MPI interaction. I will investigate further, and come back here if the error is not coming from upstream.

For now feel free to close the issue if you see fit, or keep it open and I will notify you of any resolution.

Cheers!

luraess commented 3 years ago

Indeed, seems to be ok regarding device selection. Could you try running on different nodes ?

I simplified further and getting segfaults with CUDA and MPI interaction

That's not nice. Do you use CUDA-aware MPI ?

You certainly did already, but these two resources may give further hints on how to set-up the environment to run Multi-GPU:

raminammour commented 3 years ago

yeah, MPI reports that it has_cuda() but segfaults still :(

luraess commented 3 years ago

Could you maybe try export IGG_CUDAAWARE_MPI=0 to disable CUDA-awareness within ImplicitGlobalGrid.jl, and see if this would fix the segfault ?

Is the machine you are running on using MPICH as MPI ?

raminammour commented 3 years ago

This machine is using OpenMPI. export IGG_CUDAAWARE_MPI=0 does not fix the OOM error above.

luraess commented 3 years ago

Does the MPI install use UCX instead of openib ? I had issues getting MPI to work properly if OpenMPI was configured with UCX.

raminammour commented 3 years ago

I am on a managed system, so cannot play with any installations unfortunately.

luraess commented 3 years ago

If you have multiple nodes, it would be interesting to see if the error occurs when running one one GPU/node on 2 nodes.

Also, do you get any errors (segfaults) running on 2 MPI processes with Thread backend ?

raminammour commented 3 years ago

It works with one gpu per node with two mpi processes, but not 2 gpus on one node!

luraess commented 3 years ago

Interesting.

omlins commented 3 years ago

I simplified further and getting segfaults with CUDA and MPI interaction. I will investigate further, and come back here if the error is not coming from upstream.

Hi @raminammour , do I understand right that you could reproduce the issue without ImplicitGlobalGrid and ParallelStencil? Then, you would best write in the GPU section in Julia Discourse in order to get help. I would guess that something is wrong with the GPU memory allocation to your processes independent of what code you run...

raminammour commented 3 years ago

Thank you both for helping!

Yes, @omlins, I was able to get a segfault just using MPI and CUDA with a broadcast/receive pattern. The OOM I am getting above sometimes leads to a segfault, which leads me to believe that there is UB behavior somewhere, as you say with the allocation of memory on the GPUs.

We have factorial combinations of cuda drivers, compilers, MPI's and different little things that are making my life difficult in building a reproducible example. Also the system is new, and these things get ironed out in the first few weeks in my experience (as admins correct things). When I manage to do so I will file upstream or discuss on discourse as you advise.

Feel free to close here for now, we can re-open if I determine that the issue pertains to ParallelStencil or ImplicitGlobalGrid, which are great, regardless :)

omlins commented 3 years ago

OK, I will close the issue then. Don't hesitate to open a topic on Julia discourse - it could well be that somebody will immediately know what's going on...

PS: thanks for your nice words about ParallelStencil and ImplicitGlobalGrid!