[Feature Request] Mixtral Offloading

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

29.97k stars 4.53k forks source link

[Feature Request] Mixtral Offloading #2394

Open shixianc opened 10 months ago

shixianc commented 10 months ago

There's a new cache technique mentioned in the paper https://arxiv.org/abs/2312.17238. (github: https://github.com/dvmazur/mixtral-offloading) They introduced LRU cache to cache experts based on patterns they found, and also took speculative guess to pre-load experts before the computation of the next layer. The result looks quite promising. Can we support it for Mixtral? This helps a lot to run on smaller GPUs.

pkantyka commented 10 months ago

Hello,

I also think that implementing that would be valuable.

It is worthwhile to notice that demo of this offloading technique is there: https://github.com/dvmazur/mixtral-offloading/blob/master/notebooks/demo.ipynb

Quote: "One will need approximately 16 GB of VRAM and 11 GB of RAM to run this notebook and generate somewhat long texts."

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!