mlfoundations / open_clip

An open source implementation of CLIP.
Other
9.29k stars 923 forks source link

Issue with Single GPU Utilization in Distributed Training using torchrun #785

Closed Lylinnnnn closed 2 months ago

Lylinnnnn commented 6 months ago

Hi!

I've encountered an issue while attempting to train a model using the torchrun script provided in the README. The script I used is as follows:

export CUDA_VISIBLE_DEVICES=0,1
torchrun --nproc_per_node 2 -m training.main \
    --train-data "/ego4d_dealt_data/medium_train_output_0104.csv" \
    --val-data "/ego4d_dealt_data/medium_test_output_0104.csv" \
    --dataset-type "csv" \
    --batch-size 128 \
    --workers 32 \
    --lr 3.2e-6 \
    --wd 0.001 \
    --warmup 2000 \
    --epochs 1 \
    --model ViT-B-32 \
    --force-image-size 224 \
    --report-to "tensorboard" \
    --log-every-n-steps 10 \
    --lr-scheduler const

Although the world_size is set to 2 and the command line output indicates that distributed mode is running on two processes (on cuda:0 and cuda:1 respectively), I've observed through nvidia-smi that only one GPU is actively being utilized during training. The usage of the second GPU consistently remains at 0%, with occasional spikes to around 90% for just a few seconds.

Could you please assist in identifying the potential causes of this issue, and suggest any necessary adjustments to ensure both GPUs are effectively utilized in distributed training?

rwightman commented 6 months ago

quite possibly a dataloading / efficiency problem.. I wouldn't recommend csv based datasets .. can you compare single GPU vs 2 GPU stats? and ignore GPU utilization %, what's the GPU power consumption? system cpu%? e