Open aniket7joshi opened 1 year ago
This Libary just wraps downloading of the model from HF, tokenizers, and Ctranslate2 internally.
Seems to me like a feature request to Ctranslate2, where there is an open issue for distributed inference. https://github.com/OpenNMT/CTranslate2/issues/1052
Multiple GPUs are supported, but every GPU would be required to hold the entire model copy on its own. If you specify multiple indices of gpu devices.
I am encountering a RuntimeError: CUDA failed with error out of memory while attempting to load the Falcon-40B-instruct model using the GeneratorCT2fromHfHub module on GPU. Upon inspecting the GPU usage with nvidia-smi, I noticed that only one GPU is utilizing all the memory, while the other GPUs remain unused.
I have reviewed the code but couldn't find any indications of multi-GPU support. Could you please confirm if multi-GPU support has been implemented, and if I missed something in the code? Alternatively, is multi-GPU support planned for future sprints?
I have attached a screenshot of the GPU memory usage for your reference.