microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.63k stars 3.95k forks source link

Why doesn't deepspeed stage 3 allow a batch size of 1 with multiple GPUs? #5645

Open AceMcAwesome77 opened 2 weeks ago

AceMcAwesome77 commented 2 weeks ago

I have trained an encoder-decoder model using pytorch lightning on a single GPU and for a batch size of 1 the input image size maxes out around 960x960 - anything above that gives OOM errors during training. I would like to obtain a higher image input size by using deepspeed stage 3 to split the model weights across 2 GPUs. When I do this, it does say that the model weights are correctly split across the 2 GPUs, but then it runs 2 training samples at once during train step, and 2 validation samples at once during val step, even though my batch size is set to 1. It seems like this defeats the point of splitting the model weights across 2 GPUs - if you do that but then double your batch size, you're getting the same memory limitations you started with. I would think you would want only 1 training sample at a time to get the benefit of larger allowable input image size.

In these docs (https://www.deepspeed.ai/docs/config-json/) I see the following:

"Note: train_batch_size must be equal to train_micro_batch_size_per_gpu gradient_accumulation_steps number of GPUs."

Since my number of GPUs is 2, and I assume that train_micro_batch_size_per_gpu and gradient_accumulation_steps must be integers, that means that my minimum train_batch_size is 2. However like I said, I don't want that because that defeats the point of splitting the model weights across the 2 GPUs in the first place! I would think a train_batch_size of 1 is what I would need to avoid OOM for resolutions above 960x960.

Am I missing something here? Could someone please explain why deepspeed stage 3 requires a batch size >= the number of GPUs, when that seems to defeat the purpose of using deepspeed stage 3 to split model weights across GPUs in the first place?

Thanks!