Closed xiaoiker closed 2 years ago
Thanks for getting in touch! I've just verified that this code can train on two GPUs. I am not sure what your infrastructure is like, but I submitted the following with SLURM to a Compute Canada cluster.
I ran sbatch job.sh
where job.sh
contains:
#!/bin/bash
#SBATCH --account=rrg-kevinlb
#SBATCH --mem=128000M
#SBATCH --cpus-per-task=12
#SBATCH --nodes=1
#SBATCH --ntasks=2 # number of MPI processes
#SBATCH --gres=gpu:2
#SBATCH --time=0-03:00 # time (DD-HH:MM)
#SBATCH --output="/scratch/wsgh/slurm-outputs/%j.out"
#SBATCH --error="/scratch/wsgh/slurm-outputs/%j.err"
export WANDB_ENTITY="universal-conditional-ddpm"
export WANDB_PROJECT="video-diffusion"
module load mpi4py # load the mpi4py python package - cannot be pip-installed on computecanada
source /home/wsgh/projects/def-fwood/wsgh/envs/flexible-video-diffusion-modeling/bin/activate # installed using `module load python/3.8.10` and then (inside a virtualenv) `pip install torch==1.10.0+computecanada torchvision==0.11.1+computecanada tqdm==4.63.1+computecanada wandb==0.12.5+computecanada matplotlib==3.5.1+computecanada imageio==2.16.2+computecanada moviepy==1.0.3+computecanada blobfile==1.2.9`
cd /scratch/wsgh/flexible-video-diffusion-modeling
wandb offline # necessary because computecanada doesn't allow internet access on computenodes
srun -n 2 python scripts/video_train.py --batch_size=2 --max_frames 20 --dataset=carla_no_traffic --num_res_blocks=1
To briefly describe this script, the SBATCH commands at the top are all SLURM-specific configurations for the desired compute allocation. The WANDB_ENTITY and WANDB_PROJECT are used within the Python code to log to wandb and should be set to an existing account and project. The following few lines activate the required Python packages/environment and move to the correct directory. wandb offline
is required to prevent the Python code attempting to log to wandb.com, which is inaccessible on the compute nodes I have access to. The syncing can be performed later by running wandb sync
from the same directory. Then srun
starts two processes (one for each task/GPU), each of which run python scripts/video_train.py
and communicate with eachother.
Hopefully this is helpful - if not then feel free to send whatever error messages you are getting :)
Hi
Thanks a lot for your response. I run my code on GCP with 4 A100. The code works well when using: python video_train.py --batch_size 2 .
But while I try to run it using 3 GPUs using this line: CUDA_VISIBLE_DEVICES=1,2,3 python video_train.py --batch_size 6, for sure there will be a OOM issue. Then I tried to manually change the WORLD_SIZE in dist_util.py to 3, then when I run this command again, it just stucked there, nothing happen.
Do you have any suggestion. (Srun for GCP seems complicated)
Best
Oh seems mpiexec works for this issue.
I should probably have made it clearer that the --batch_size
argument refers to the batch size per GPU rather than the total batch size. So e.g. my example script was using a batch size of 4 split over 2 GPUs and you probably need to use --batch_size 2
to avoid the OOM error (which is a total batch size of 6 over 3 GPUs).
And glad to hear that mpiexec
fixed it, was about to suggest that :D
Hi, thanks for releasing this great project. One small question is that seems the code does not support Multi-GPU training. Is this true, if not how can set it? Best!