microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.38k stars 4.11k forks source link

[REQUEST] Support multiple models using deepspeed #3093

Open zhzou2020 opened 1 year ago

zhzou2020 commented 1 year ago

I'm frustrated when I am trying to implement PPO using deepspeed, which needs to run actor, critic and reward model at the same time.

It seems that deepspeed cannot support running multiple models at the same time well. Error always happens like this: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate xxx MiB

I don't know if there's any functionalities that can support this kind of scenarios. For example, can we specify the gpus each model are using so that we can avoid GPU memory allocation issues by using different GPUs for different models?

Thanks!

tjruwase commented 1 year ago

Please see our recent DeepSpeed Chat release: https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat