is there a library I could use to distribute model loading b/w gpu and cpu, I have gpu with 16gb memory and tried https://huggingface.co/blog/assisted-generation (the model upto 1.3b params works fine) but model 6.7b params and beyond fail to load due to large memory needed, is there a library that I could use to share the load b/w cpu and gpu?
Check this doc from accelerate library. You can use big model inference directly by passing device_map in from_pretained if you are using transformers library !
is there a library I could use to distribute model loading b/w gpu and cpu, I have gpu with 16gb memory and tried https://huggingface.co/blog/assisted-generation (the model upto 1.3b params works fine) but model 6.7b params and beyond fail to load due to large memory needed, is there a library that I could use to share the load b/w cpu and gpu?