microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.42k stars 241 forks source link

Support for llama.cpp or exl2 #41

Open TechnotechGit opened 8 months ago

TechnotechGit commented 8 months ago

Hi, this is an interesting project. I would like to use this with llama.cpp (llama-cpp-python more specifically), but when I had a look at the code I wasn't able to switch out the model loafers (I got stuck on the attention mask). Are you planning on officially integrating more model loaders/formats?

iofu728 commented 8 months ago

Hi @TechnotechGit,

Thank you for your support with LLMLingua. As mentioned in issue #40, I don't believe there are any significant obstacles to supporting exl2. However, I currently do not have the bandwidth to provide support for it. I would be very welcoming if you are interested in contributing to that.

TechnotechGit commented 8 months ago

I've been working on trying to support llama.cpp via llama-cpp-python; I will see if I can make a PR soon (working through it, attention masks unsolved as llama.cpp has them, but llama-cpp-python does not seem to bind them).

Regarding the attention masks, they seem to be important here but when debugging I only ever saw lists full of 1s, no 0s. Are they only used when submitting a list to the context property for batch processing? In which case, would it be worth implementing this in a later PR?

iofu728 commented 8 months ago

Hi @TechnotechGit,

Thank you for your effort. To my recollection, the attention mask indeed hasn't been utilized, and I think it could be implemented later on. Once again, I appreciate your help.

TechnotechGit commented 7 months ago

@iofu728 Just to update, the latest bug is with the logits. I'm not very experienced with low level PyTorch, so my guess is that this line is to focus on the last 203 tokens? When I run, I get the following via some print statements I added:

Len input ids: 240
Len attention mask: 240
Len input ids: 203
Len attention mask: 203
Traceback (most recent call last):
  File "c:\...\Documents\LLMLingua\run.py", line 48, in <module>
    compressed_prompt = llm_lingua.compress_prompt(
  File "c:\...\Documents\LLMLingua\llmlingua\prompt_compressor.py", line 309, in compress_prompt
    context = self.iterative_compress_prompt(
  File "c:\...\Documents\LLMLingua\llmlingua\prompt_compressor.py", line 797, in iterative_compress_prompt
    loss, past_key_values = self.get_ppl(
  File "c:\...\Documents\LLMLingua\llmlingua\prompt_compressor.py", line 174, in get_ppl
    active_logits = shift_logits.view(-1, shift_logits.size(-1))[active]
IndexError: The shape of the mask [202] at index 0 does not match the shape of the indexed tensor [442, 50257] at index 0    

On a side note, I hope that the following lines

past_key_values = [
    [
        torch.cat([k[..., :s, :], k[..., s + e :, :]], dim=-2),
        torch.cat([v[..., :s, :], v[..., s + e :, :]], dim=-2),
    ]
    for k, v in past_key_values
]

are not important, as llama-cpp-python seems to return bytes for the cache, not an 'exposed' kv cache so if this is vital then I suppose more work will be needed for that then.

iofu728 commented 6 months ago

Hi @TechnotechGit, I'm deeply sorry for missing your message. Thank you very much for your assistance.

The first issue arises because LLMLingua utilizes a KV Cache to avoid recalculating segments that have already been computed. I suspect the mismatch might be due to not removing the segments that have been calculated. If you have more code, I could help address the issue together.

The second issue can be disregarded. It's merely to reduce the length of the KV cache to support longer prompts.