microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.42k stars 241 forks source link

autogen compressible agent integration #28

Open yenif opened 8 months ago

yenif commented 8 months ago

Hello!

I put LLM Lingua into Autogen as part of a compressible agent https://github.com/microsoft/autogen/pull/1005

Basically functional but too slow on my mac book with llama2 to really test.

I figured I'd try phi-2 but it didn't return past_key_values, I have no clue if that is a dead end or fixable :-)

Would appreciate any input on effectively using LLM Lingua as the compressor for gpt agents.

Thanks!

iofu728 commented 8 months ago

Hello @yenif,

Thank you for your help and support. We agree that agent scenarios like AutoGen are well-suited for approaches like LLMLingua to reduce token redundancy. However, there might be new issues that need to be addressed.

The reason phi-x models cannot be directly invoked is that phi-x does not inherit from HuggingFace's AutoCausalLM to build the code https://huggingface.co/microsoft/phi-2/blob/main/modeling_phi.py#L960, which results in LLMLingua being unable to call the kv-cache. One solution could be to rebuild the phi-x code within the AutoCausalLM framework.