[REQUEST] Support multiple models using deepspeed

microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Apache License 2.0

35.38k stars 4.11k forks source link

I'm frustrated when I am trying to implement PPO using deepspeed, which needs to run actor, critic and reward model at the same time.

It seems that deepspeed cannot support running multiple models at the same time well. Error always happens like this: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate xxx MiB

I don't know if there's any functionalities that can support this kind of scenarios. For example, can we specify the gpus each model are using so that we can avoid GPU memory allocation issues by using different GPUs for different models?

Thanks!

microsoft / DeepSpeed

[REQUEST] Support multiple models using deepspeed #3093