Issue with Single GPU Utilization in Distributed Training using torchrun

Hi！

I've encountered an issue while attempting to train a model using the torchrun script provided in the README. The script I used is as follows:

export CUDA_VISIBLE_DEVICES=0,1
torchrun --nproc_per_node 2 -m training.main \
    --train-data "/ego4d_dealt_data/medium_train_output_0104.csv" \
    --val-data "/ego4d_dealt_data/medium_test_output_0104.csv" \
    --dataset-type "csv" \
    --batch-size 128 \
    --workers 32 \
    --lr 3.2e-6 \
    --wd 0.001 \
    --warmup 2000 \
    --epochs 1 \
    --model ViT-B-32 \
    --force-image-size 224 \
    --report-to "tensorboard" \
    --log-every-n-steps 10 \
    --lr-scheduler const

Although the world_size is set to 2 and the command line output indicates that distributed mode is running on two processes (on cuda:0 and cuda:1 respectively), I've observed through nvidia-smi that only one GPU is actively being utilized during training. The usage of the second GPU consistently remains at 0%, with occasional spikes to around 90% for just a few seconds.

Could you please assist in identifying the potential causes of this issue, and suggest any necessary adjustments to ensure both GPUs are effectively utilized in distributed training?

mlfoundations / open_clip

Issue with Single GPU Utilization in Distributed Training using torchrun #785