[Question]: Running LLMLingua with GGUF models

microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

MIT License

4.27k stars 228 forks source link

Describe the issue

Hi, trying to make it run properly with GGUF models (i.e. CPU only) due to RAM restriction, Trying to use it as

compressor = PromptCompressor(
    device_map="cpu",
    model_name="TheBloke/Llama-2-7B-GGUF",
    model_config={ 'model_file': "llama-2-7b.Q4_K_M.gguf", 'model_type': "llama", 'gpu_layers': 0 }
)

but need to somehow push some code for using llama-cpp so I can load it properly (otherwise stops at tokenizer) anyone already done this ? is it planned to be supported ? or would any have an advice on how to proceed

microsoft / LLMLingua

[Question]: Running LLMLingua with GGUF models #100

Describe the issue