I can't train model using 2 GPUs

microsoft / Oscar

Oscar and VinVL

MIT License

1.04k stars 252 forks source link

I can't train model using 2 GPUs #163

Open gabrielsantosrv opened 3 years ago

gabrielsantosrv commented 3 years ago

I am trying to train the captioning base model on two Quadro RTX 8000 GPUs, each one with 48GiB RAM. But when I run the command to train the model

python oscar/run_captioning.py --model_name_or_path pretrained_models/base-vg-labels/ep_67_588997 --do_train --do_lower_case --evaluate_during_training --add_od_labels --learning_rate 0.00003 --per_gpu_train_batch_size 64 --num_train_epochs 30 --save_steps 5000 --output_dir output/

it shows

Device: cuda, n_gpu: 1

Also, it returns the error

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

and the training stops.

PS: I am using both GPUs on the same machine, so I am not training the model in a distributed way

PepZhu commented 3 years ago

Meet the same error. Have you solve it?

It shows " dist.is_initialized() is False" and "os.environ['WORLD_SIZE'] ** keyError WORLD_SIZE".

gabrielsantosrv commented 3 years ago

@PepZhu I couldn't solve it yet...

jontooy commented 2 years ago

I googled it and found this similar issue. There they suggest you launch using the pytorch distributed launch utility, i.e.

python -m torch.distributed.launch [torch.distributed.launch params] your_script.py [your_script params]

I work on 8 Tesla K80 GPU:s and this script works fine for me:

python -m torch.distributed.launch --nproc_per_node=8 oscar/run_captioning.py [your_script params]

Hope this helps!

SamratThapa120 commented 2 years ago

The following code works for me on 4 GPUs

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 run_captioning.py \
    --model_name_or_path ./checkpoints/checkpoint-29-66420 \
     --do_train \
     --do_lower_case \
     --evaluate_during_training \
     --add_od_labels \
     --learning_rate 0.00003 \
     --per_gpu_train_batch_size 64 \
     --num_train_epochs 30 \
     --save_steps 40 \
     --output_dir ./checkpoints/new_checkpoints \
     --train_yaml val.yaml \
     --data_dir ../sample_test/nocaps \
     --val_yaml val_coco.yaml