I try to pretrain blip2 on a slurm cluster, but it seems that the current programme does not support distributed training on slurm by default. Any advice on it?
| distributed init (rank 0, world 1): env://
Traceback (most recent call last):
File "/public/home/v-zhaozh/LAVIS/train.py", line 105, in
main()
File "/public/home/v-zhaozh/LAVIS/train.py", line 85, in main
init_distributed_mode(cfg.run_cfg)
File "/public/home/v-zhaozh/LAVIS/lavis/common/dist_utils.py", line 80, in init_distributed_mode
torch.distributed.init_process_group(
File "/public/home/v-zhaozh/anaconda3/envs/lavis/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/public/home/v-zhaozh/anaconda3/envs/lavis/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 243, in _env_rendezvous_handler
master_addr = _get_env_or_raise("MASTER_ADDR")
File "/public/home/v-zhaozh/anaconda3/envs/lavis/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 221, in _get_env_or_raise
raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set
I try to pretrain blip2 on a slurm cluster, but it seems that the current programme does not support distributed training on slurm by default. Any advice on it?
| distributed init (rank 0, world 1): env:// Traceback (most recent call last): File "/public/home/v-zhaozh/LAVIS/train.py", line 105, in
main()
File "/public/home/v-zhaozh/LAVIS/train.py", line 85, in main
init_distributed_mode(cfg.run_cfg)
File "/public/home/v-zhaozh/LAVIS/lavis/common/dist_utils.py", line 80, in init_distributed_mode
torch.distributed.init_process_group(
File "/public/home/v-zhaozh/anaconda3/envs/lavis/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/public/home/v-zhaozh/anaconda3/envs/lavis/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 243, in _env_rendezvous_handler
master_addr = _get_env_or_raise("MASTER_ADDR")
File "/public/home/v-zhaozh/anaconda3/envs/lavis/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 221, in _get_env_or_raise
raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set