odissei-lifecourse / life-sequencing-dutch

MIT License
0 stars 0 forks source link

Multiple GPUs on OSSC #8

Open tanzir5 opened 4 months ago

tanzir5 commented 4 months ago

Summarizing the situation here (Apr 16):

  1. We can run pytorch_lightning with a single gpu as long as the strategy for trainer is "auto" (default) (without srun)
  2. It fails when the strategy is "ddp" even when we are using a single gpu (also fails for multi-gpu of course).
  3. For future, export CUDA_VISIBLE_DEVICES=0,1,2,3 can be helpful (but we have not confirmed yet if it is necessary)
  4. srun issue is partially resolved at this moment with the fix: srun --mpi=pmi2 python3 hello.py --mpi=pmi2 is the key
  5. Ben is looking into fixing multi-gpu thing.

TODO FRIDAY (APR 19):

  1. Try running the batch script for check_gpu.py for a single gpu with srun to verify srun issue is completely fixed.
f-hafner commented 4 months ago

It seems multi-gpu will work with the MPI backend, but it's unclear how much the CPU-GPU transfer slows down training.

tanzir5 commented 3 months ago

Interestingly the above command also works for multi-gpu training for the dummy model but it fails for the Language model, regardless of the number of gpus. I should verify on Snellius that the language model can train using DDP over MPI, and also maybe on Stony Brook server once we have fake data.

f-hafner commented 3 months ago

see also #23

f-hafner commented 1 month ago

An intermediate conclusion from #23 is to use srun, but not MPI backend. The current code is about as fast with nccl and gloo backends. We have not tried either on the OSSC though. Depending on the outcome of #74, we need to make nccl work on the OSSC as well.

f-hafner commented 2 weeks ago

by now I know that NCCL works on the OSSC with pytorch (with FSDP on a much larger model than ours). We need to figure out if our issue persists and if it's down to lightning or to our own code.

benczaja commented 2 weeks ago

Hey I tried with some PyTorch DDP (using NCCL) I think we need to set NCCL_SOCKET_IFNAME=lo in the jobscript

Actually probably better

export NCCL_SOCKET_IFNAME=ib0

You are right that NCCL does work, it just seems that we need to be explicit in the OSSC