huggingface / speech-to-speech

Speech To Speech: an effort for an open-sourced and modular GPT4-o
Apache License 2.0
3.18k stars 335 forks source link

Unable to use multiple GPUs – CUDA Out of Memory issue #105

Open Devloper-RG opened 2 weeks ago

Devloper-RG commented 2 weeks ago

While Using other models like meta-llama/Meta-Llama-3.1-8B-Instruct I'm encountering a torch.OutOfMemoryError when trying to load a model on multiple GPUs. I have 4 GPUs, each with 14.57 GiB memory, but the model fails to allocate memory on GPU 0, even though other GPUs should share the load.

eustlb commented 2 weeks ago

Hey @Devloper-RG, thanks for raising this issue and testing the lib in a multi-GPU setup 🙏 I'd be glad to help on that, can you provide a reproducer?

andimarafioti commented 2 weeks ago

I guess the issue here is that we are pushing to cuda as a device.

Devloper-RG commented 2 weeks ago

Hey @eustlb , thanks for getting back to me!

I made some modifications to the code to use the meta-llama/Meta-Llama-3.1-8B-Instruct model by updating the arguments_classes/language_model_arguments.py script. Also adjusted the LLM/language_model.py script to allow the model to be accessed via Hugging Face.

Thereafter I ran the server on a Google Cloud Platform (GCP) VM with 2 NVIDIA T4 GPUs. During testing, I noticed that one of the GPUs consistently overloads, leading to a torch.OutOfMemoryError.

I tried using the DataParallel method, but it didn’t resolve the issue. I also attempted to run the model in lower precision, which worked on a single GPU, but I’d like to use higher precision models and fully leverage multiple GPUs for better performance.

Any help with getting multi-GPU support working would be greatly appreciated!