team-ocean / veros

The versatile ocean simulator, in pure Python, powered by JAX.
https://veros.readthedocs.io
MIT License
330 stars 55 forks source link

How to run veros with multi-GPU #417

Open HuangLianghong opened 1 year ago

HuangLianghong commented 1 year ago

Hi! I am trying to run veros with multi-gpu, it works when I run acc_benchmark.py. But when I try to run global_flexible.py with the instruction mpirun -np 2 veros run global_flexible/global_flexible.py -n 1 2 --force-overwrite -b jax --device gpu, it seems that only one GPU is working.

image

Could you please tell me what should I do? Thanks in advance!

dionhaefner commented 1 year ago

That usage is correct but you also need to instruct each process about which GPU to use (they default to the first one, so both processes are using the same device in your case).

You can use something like

import os
from mpi4py import MPI

os.environ["CUDA_VISIBLE_DEVICES"] = str(MPI.COMM_WORLD.Get_rank())

near the top of your setup file (before importing veros.core or JAX).

Sougata18 commented 6 months ago

Screenshot_veros

I have been trying to run VEROS with multi-GPU (4 GPUs) with the modification suggested above. But it randomly stops at an iteration and says 'solution diverged'. Also the gpu memory usage is very less. Could you please suggest any solutions ?

Below is the attached log file. log.txt

dionhaefner commented 6 months ago

But it randomly stops at an iteration and says 'solution diverged'.

Looks like your solution diverged. Multi-GPU runs use a different linear solver than other configurations by default so you might see divergence for runs that are stable in other settings. Please revise your time steps and / or solver settings.

You can also use different PETSc settings like this: https://github.com/dionhaefner/veros-01deg/blob/4f096b11206fecfac003047a234fcf25f92291a0/global_01deg/global_01deg.py#L14-L16

Note that multi-GPU runs for very high-res setups are not well explored so some manual tweaking is to be expected.

Unless the simulation doesn't fit on a single A100 I would stay away from multi-GPU runs (it probably won't even be faster than single-GPU in this case).

Also the gpu memory usage is very less.

It is not, I see 4 GPUs using ~60GB of memory each, as expected.

Sougata18 commented 6 months ago

Sorry my mistake, memory usage is high but GPU utilization is very less , highest one is 20%

dionhaefner commented 6 months ago

Yes, another indicator that your GPUs aren't fully exhausted so you should probably not run on multi-GPU in the first place.