ml-explore / mlx-examples

Examples in the MLX framework
MIT License
6.09k stars 867 forks source link

[Feature Request] MLX_lm: Store KV cache of computed prompts to disk to avoid re-compute in follow-up runs #917

Closed mark-lord closed 1 month ago

mark-lord commented 2 months ago

It would be great if MLX_lm supported a --cache_prompt flag like in llama.cpp's integration (link to their discussion + eventual PR).

This would be a big benefit in reducing latency / start up time for repeated runs that include the same prompts - e.g. chatbot applications with a running chatlog, or when using long multishot examples since these take up a lot of tokens (and are very useful for production environments).

I'm no expert so likely won't be able to assist with a PR on this :'( But looking at the discussion for the initial llama.cpp integration, I have a couple of observations:

The default behaviour of a --cache-prompt flag could be to just save the KV cache to the model folder, like is done with trained adapters; though it'd be useful to then have additional flags to maybe clear or delete the cache afterwards

Jckwind commented 2 months ago

take a look at this open source library / repo: https://github.com/otriscon/llm-structured-output they have an implementation of a reusable KV cache for mlx. i've gotten it working - works surprisingly well!

awni commented 1 month ago

I think we can close this! 🚀

Documentation on usage.