To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
but need to somehow push some code for using llama-cpp so I can load it properly (otherwise stops at tokenizer)
anyone already done this ? is it planned to be supported ? or would any have an advice on how to proceed
Describe the issue
Hi, trying to make it run properly with GGUF models (i.e. CPU only) due to RAM restriction, Trying to use it as
but need to somehow push some code for using
llama-cpp
so I can load it properly (otherwise stops at tokenizer) anyone already done this ? is it planned to be supported ? or would any have an advice on how to proceed