plai-group / flexible-video-diffusion-modeling

MIT License
111 stars 14 forks source link

multi-GPU support #1

Closed xiaoiker closed 2 years ago

xiaoiker commented 2 years ago

Hi, thanks for releasing this great project. One small question is that seems the code does not support Multi-GPU training. Is this true, if not how can set it? Best!

wsgharvey commented 2 years ago

Thanks for getting in touch! I've just verified that this code can train on two GPUs. I am not sure what your infrastructure is like, but I submitted the following with SLURM to a Compute Canada cluster.

I ran sbatch job.sh where job.sh contains:

#!/bin/bash
#SBATCH --account=rrg-kevinlb
#SBATCH --mem=128000M
#SBATCH --cpus-per-task=12
#SBATCH --nodes=1
#SBATCH --ntasks=2               # number of MPI processes
#SBATCH --gres=gpu:2
#SBATCH --time=0-03:00           # time (DD-HH:MM)
#SBATCH --output="/scratch/wsgh/slurm-outputs/%j.out"
#SBATCH --error="/scratch/wsgh/slurm-outputs/%j.err"

export WANDB_ENTITY="universal-conditional-ddpm"
export WANDB_PROJECT="video-diffusion"

module load mpi4py  # load the mpi4py python package - cannot be pip-installed on computecanada
source /home/wsgh/projects/def-fwood/wsgh/envs/flexible-video-diffusion-modeling/bin/activate  # installed using `module load python/3.8.10` and then (inside a virtualenv) `pip install torch==1.10.0+computecanada torchvision==0.11.1+computecanada tqdm==4.63.1+computecanada wandb==0.12.5+computecanada matplotlib==3.5.1+computecanada imageio==2.16.2+computecanada moviepy==1.0.3+computecanada blobfile==1.2.9`
cd /scratch/wsgh/flexible-video-diffusion-modeling
wandb offline  # necessary because computecanada doesn't allow internet access on computenodes

srun -n 2 python scripts/video_train.py --batch_size=2 --max_frames 20 --dataset=carla_no_traffic --num_res_blocks=1

To briefly describe this script, the SBATCH commands at the top are all SLURM-specific configurations for the desired compute allocation. The WANDB_ENTITY and WANDB_PROJECT are used within the Python code to log to wandb and should be set to an existing account and project. The following few lines activate the required Python packages/environment and move to the correct directory. wandb offline is required to prevent the Python code attempting to log to wandb.com, which is inaccessible on the compute nodes I have access to. The syncing can be performed later by running wandb sync from the same directory. Then srun starts two processes (one for each task/GPU), each of which run python scripts/video_train.py and communicate with eachother.

Hopefully this is helpful - if not then feel free to send whatever error messages you are getting :)

xiaoiker commented 2 years ago

Hi

Thanks a lot for your response. I run my code on GCP with 4 A100. The code works well when using: python video_train.py --batch_size 2 .

But while I try to run it using 3 GPUs using this line: CUDA_VISIBLE_DEVICES=1,2,3 python video_train.py --batch_size 6, for sure there will be a OOM issue. Then I tried to manually change the WORLD_SIZE in dist_util.py to 3, then when I run this command again, it just stucked there, nothing happen.

Do you have any suggestion. (Srun for GCP seems complicated)

Best

xiaoiker commented 2 years ago

Oh seems mpiexec works for this issue.

wsgharvey commented 2 years ago

I should probably have made it clearer that the --batch_size argument refers to the batch size per GPU rather than the total batch size. So e.g. my example script was using a batch size of 4 split over 2 GPUs and you probably need to use --batch_size 2 to avoid the OOM error (which is a total batch size of 6 over 3 GPUs).

And glad to hear that mpiexec fixed it, was about to suggest that :D