Open elephantpanda opened 1 year ago
As per this Huggingface article:
It has a way to load very large models in limited GPU by diving the model into several pieces and swapping them out between the RAM and VRAM.
I'm not sure if this already possible, or being worked on. But this seems like a very sensible idea to incorporate into OnnxRuntime.
This might be useful if say you have a large 10GB model but only 16GB of VRAM.
If you also had 32GB of RAM, then this would make use of that so you could load bigger models.
Here is some more research on this topic by Stanford which may be useful:
Flex Gen
Describe the feature request
As per this Huggingface article:
It has a way to load very large models in limited GPU by diving the model into several pieces and swapping them out between the RAM and VRAM.
I'm not sure if this already possible, or being worked on. But this seems like a very sensible idea to incorporate into OnnxRuntime.
Describe scenario use case
This might be useful if say you have a large 10GB model but only 16GB of VRAM.
If you also had 32GB of RAM, then this would make use of that so you could load bigger models.