Open gabrielsantosrv opened 3 years ago
Meet the same error. Have you solve it?
It shows " dist.is_initialized() is False" and "os.environ['WORLD_SIZE'] ** keyError WORLD_SIZE".
@PepZhu I couldn't solve it yet...
I googled it and found this similar issue. There they suggest you launch using the pytorch distributed launch utility, i.e.
python -m torch.distributed.launch [torch.distributed.launch params] your_script.py [your_script params]
I work on 8 Tesla K80 GPU:s and this script works fine for me:
python -m torch.distributed.launch --nproc_per_node=8 oscar/run_captioning.py [your_script params]
Hope this helps!
The following code works for me on 4 GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 run_captioning.py \
--model_name_or_path ./checkpoints/checkpoint-29-66420 \
--do_train \
--do_lower_case \
--evaluate_during_training \
--add_od_labels \
--learning_rate 0.00003 \
--per_gpu_train_batch_size 64 \
--num_train_epochs 30 \
--save_steps 40 \
--output_dir ./checkpoints/new_checkpoints \
--train_yaml val.yaml \
--data_dir ../sample_test/nocaps \
--val_yaml val_coco.yaml
I am trying to train the captioning base model on two Quadro RTX 8000 GPUs, each one with 48GiB RAM. But when I run the command to train the model
python oscar/run_captioning.py --model_name_or_path pretrained_models/base-vg-labels/ep_67_588997 --do_train --do_lower_case --evaluate_during_training --add_od_labels --learning_rate 0.00003 --per_gpu_train_batch_size 64 --num_train_epochs 30 --save_steps 5000 --output_dir output/
it shows
Also, it returns the error
and the training stops.
PS: I am using both GPUs on the same machine, so I am not training the model in a distributed way