Hi, just wondering if distributed training works the way I think it does where GPU VRAM is shared between all available GPUs enabling larger batch sizes/higher resolutions training images etc... I am currently performing training on RunPod which provides a variety of different GPUs but I've found that I'm capped with how much VRAM I have access to if I only use a singular GPU (1x A100) - which then led to me trying to leverage multiple GPUs. However, whether it's an issue with RunPod or my understanding of MMDistributedDataParallel, it seems as though it's not creating a shared memory pool but rather just splitting it over multiple GPUs for faster training.
Hi, just wondering if distributed training works the way I think it does where GPU VRAM is shared between all available GPUs enabling larger batch sizes/higher resolutions training images etc... I am currently performing training on RunPod which provides a variety of different GPUs but I've found that I'm capped with how much VRAM I have access to if I only use a singular GPU (1x A100) - which then led to me trying to leverage multiple GPUs. However, whether it's an issue with RunPod or my understanding of MMDistributedDataParallel, it seems as though it's not creating a shared memory pool but rather just splitting it over multiple GPUs for faster training.
Any help would be appreciated!