Closed Lylinnnnn closed 2 months ago
quite possibly a dataloading / efficiency problem.. I wouldn't recommend csv based datasets .. can you compare single GPU vs 2 GPU stats? and ignore GPU utilization %, what's the GPU power consumption? system cpu%? e
Hi!
I've encountered an issue while attempting to train a model using the torchrun script provided in the README. The script I used is as follows:
Although the world_size is set to 2 and the command line output indicates that distributed mode is running on two processes (on cuda:0 and cuda:1 respectively), I've observed through nvidia-smi that only one GPU is actively being utilized during training. The usage of the second GPU consistently remains at 0%, with occasional spikes to around 90% for just a few seconds.
Could you please assist in identifying the potential causes of this issue, and suggest any necessary adjustments to ensure both GPUs are effectively utilized in distributed training?