michaelfeil / hf-hub-ctranslate2

Connecting Transformers on HuggingFace Hub with CTranslate2
https://michaelfeil.github.io/hf-hub-ctranslate2/
MIT License
32 stars 2 forks source link

Multi-GPU support for GeneratorCT2fromHfHub Falcon-40B-instruct #11

Open aniket7joshi opened 1 year ago

aniket7joshi commented 1 year ago

I am encountering a RuntimeError: CUDA failed with error out of memory while attempting to load the Falcon-40B-instruct model using the GeneratorCT2fromHfHub module on GPU. Upon inspecting the GPU usage with nvidia-smi, I noticed that only one GPU is utilizing all the memory, while the other GPUs remain unused.

I have reviewed the code but couldn't find any indications of multi-GPU support. Could you please confirm if multi-GPU support has been implemented, and if I missed something in the code? Alternatively, is multi-GPU support planned for future sprints?

I have attached a screenshot of the GPU memory usage for your reference.

Screenshot 2023-07-10 at 5 10 19 PM
michaelfeil commented 1 year ago

This Libary just wraps downloading of the model from HF, tokenizers, and Ctranslate2 internally.

Seems to me like a feature request to Ctranslate2, where there is an open issue for distributed inference. https://github.com/OpenNMT/CTranslate2/issues/1052

Multiple GPUs are supported, but every GPU would be required to hold the entire model copy on its own. If you specify multiple indices of gpu devices.