microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.27k stars 228 forks source link

[Question]: Running LLMLingua with GGUF models #100

Open 92dev opened 4 months ago

92dev commented 4 months ago

Describe the issue

Hi, trying to make it run properly with GGUF models (i.e. CPU only) due to RAM restriction, Trying to use it as

compressor = PromptCompressor(
    device_map="cpu",
    model_name="TheBloke/Llama-2-7B-GGUF",
    model_config={ 'model_file': "llama-2-7b.Q4_K_M.gguf", 'model_type': "llama", 'gpu_layers': 0 }
)

but need to somehow push some code for using llama-cpp so I can load it properly (otherwise stops at tokenizer) anyone already done this ? is it planned to be supported ? or would any have an advice on how to proceed

iofu728 commented 4 months ago

Hi @92dev, currently, @Technotech is assisting in making llama-cpp support LLMLingua. You can find more details at https://github.com/abetlen/llama-cpp-python/issues/1065, #41.