Closed mark-lord closed 1 month ago
take a look at this open source library / repo: https://github.com/otriscon/llm-structured-output they have an implementation of a reusable KV cache for mlx. i've gotten it working - works surprisingly well!
I think we can close this! 🚀
It would be great if MLX_lm supported a --cache_prompt flag like in llama.cpp's integration (link to their discussion + eventual PR).
This would be a big benefit in reducing latency / start up time for repeated runs that include the same prompts - e.g. chatbot applications with a running chatlog, or when using long multishot examples since these take up a lot of tokens (and are very useful for production environments).
I'm no expert so likely won't be able to assist with a PR on this :'( But looking at the discussion for the initial llama.cpp integration, I have a couple of observations:
The default behaviour of a --cache-prompt flag could be to just save the KV cache to the model folder, like is done with trained adapters; though it'd be useful to then have additional flags to maybe clear or delete the cache afterwards