open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
28.18k stars 9.24k forks source link

Distributed Training #11473

Open riley-ball opened 3 months ago

riley-ball commented 3 months ago

Hi, just wondering if distributed training works the way I think it does where GPU VRAM is shared between all available GPUs enabling larger batch sizes/higher resolutions training images etc... I am currently performing training on RunPod which provides a variety of different GPUs but I've found that I'm capped with how much VRAM I have access to if I only use a singular GPU (1x A100) - which then led to me trying to leverage multiple GPUs. However, whether it's an issue with RunPod or my understanding of MMDistributedDataParallel, it seems as though it's not creating a shared memory pool but rather just splitting it over multiple GPUs for faster training.

Any help would be appreciated!

riley-ball commented 3 months ago

@chhluo Sorry for pinging you but seems as though @hhaAndroid hasn't been checking up on issues - think you could provide an answer?