microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.69k stars 2.93k forks source link

[Feature Request] CPU RAM Offloading for large models #14831

Open elephantpanda opened 1 year ago

elephantpanda commented 1 year ago

Describe the feature request

As per this Huggingface article:

It has a way to load very large models in limited GPU by diving the model into several pieces and swapping them out between the RAM and VRAM.

I'm not sure if this already possible, or being worked on. But this seems like a very sensible idea to incorporate into OnnxRuntime.

Describe scenario use case

This might be useful if say you have a large 10GB model but only 16GB of VRAM.

If you also had 32GB of RAM, then this would make use of that so you could load bigger models.

elephantpanda commented 1 year ago

Here is some more research on this topic by Stanford which may be useful:

Flex Gen