Open HuangLianghong opened 1 year ago
That usage is correct but you also need to instruct each process about which GPU to use (they default to the first one, so both processes are using the same device in your case).
You can use something like
import os
from mpi4py import MPI
os.environ["CUDA_VISIBLE_DEVICES"] = str(MPI.COMM_WORLD.Get_rank())
near the top of your setup file (before importing veros.core
or JAX).
I have been trying to run VEROS with multi-GPU (4 GPUs) with the modification suggested above. But it randomly stops at an iteration and says 'solution diverged'. Also the gpu memory usage is very less. Could you please suggest any solutions ?
Below is the attached log file. log.txt
But it randomly stops at an iteration and says 'solution diverged'.
Looks like your solution diverged. Multi-GPU runs use a different linear solver than other configurations by default so you might see divergence for runs that are stable in other settings. Please revise your time steps and / or solver settings.
You can also use different PETSc settings like this: https://github.com/dionhaefner/veros-01deg/blob/4f096b11206fecfac003047a234fcf25f92291a0/global_01deg/global_01deg.py#L14-L16
Note that multi-GPU runs for very high-res setups are not well explored so some manual tweaking is to be expected.
Unless the simulation doesn't fit on a single A100 I would stay away from multi-GPU runs (it probably won't even be faster than single-GPU in this case).
Also the gpu memory usage is very less.
It is not, I see 4 GPUs using ~60GB of memory each, as expected.
Sorry my mistake, memory usage is high but GPU utilization is very less , highest one is 20%
Yes, another indicator that your GPUs aren't fully exhausted so you should probably not run on multi-GPU in the first place.
Hi! I am trying to run veros with multi-gpu, it works when I run
acc_benchmark.py
. But when I try to runglobal_flexible.py
with the instructionmpirun -np 2 veros run global_flexible/global_flexible.py -n 1 2 --force-overwrite -b jax --device gpu
, it seems that only one GPU is working.Could you please tell me what should I do? Thanks in advance!