salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.71k stars 952 forks source link

how to implement it on a slurm cluster #396

Open zhaozh10 opened 1 year ago

zhaozh10 commented 1 year ago

I try to pretrain blip2 on a slurm cluster, but it seems that the current programme does not support distributed training on slurm by default. Any advice on it?

| distributed init (rank 0, world 1): env:// Traceback (most recent call last): File "/public/home/v-zhaozh/LAVIS/train.py", line 105, in main() File "/public/home/v-zhaozh/LAVIS/train.py", line 85, in main init_distributed_mode(cfg.run_cfg) File "/public/home/v-zhaozh/LAVIS/lavis/common/dist_utils.py", line 80, in init_distributed_mode torch.distributed.init_process_group( File "/public/home/v-zhaozh/anaconda3/envs/lavis/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/public/home/v-zhaozh/anaconda3/envs/lavis/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 243, in _env_rendezvous_handler master_addr = _get_env_or_raise("MASTER_ADDR") File "/public/home/v-zhaozh/anaconda3/envs/lavis/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 221, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set

ChantalMP commented 1 year ago

Hi :) Did you find a solution for this?

zhaozh10 commented 1 year ago

Hi :) Did you find a solution for this?

I borrowed some DDP-related code from MMEngine, and it works well on my slurm cluster. The revised version of dist_utils.py can be found here