Open shixianc opened 10 months ago
Hello,
I also think that implementing that would be valuable.
It is worthwhile to notice that demo of this offloading technique is there: https://github.com/dvmazur/mixtral-offloading/blob/master/notebooks/demo.ipynb
Quote: "One will need approximately 16 GB of VRAM and 11 GB of RAM to run this notebook and generate somewhat long texts."
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
There's a new cache technique mentioned in the paper https://arxiv.org/abs/2312.17238. (github: https://github.com/dvmazur/mixtral-offloading) They introduced LRU cache to cache experts based on patterns they found, and also took speculative guess to pre-load experts before the computation of the next layer. The result looks quite promising. Can we support it for Mixtral? This helps a lot to run on smaller GPUs.