microsoft / LLMLingua

[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.53k stars 253 forks source link

How to setup LLMLingua with localhost? #47

Open JiHa-Kim opened 9 months ago

JiHa-Kim commented 9 months ago

Hello, how do I set up LLMLingua with a self-hosted localhost server? Is there a tutorial? Thanks.

### Tasks
iofu728 commented 9 months ago

Hi @JiHa-Kim,

Thank you for your support. I suggest referring to the code of the Hugging Face space demo as a reference. You can then build a self-hosted local server using Gradio.

JiHa-Kim commented 9 months ago

How do you use GGUF format instead of GPTQ? Can you use with LM Studio to host? It would be very great to run inference shared on CPU+GPU.

Also, how do you get it to work with an AI API endpoint? I keep getting the error:

    compressed_prompt = llm_lingua.compress_prompt(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\prompt_compressor.py", line 143, in compress_prompt
    context_tokens_length = [self.get_token_length(c) for c in context]
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\prompt_compressor.py", line 143, in <listcomp>
    context_tokens_length = [self.get_token_length(c) for c in context]
                             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\prompt_compressor.py", line 254, in get_token_length
    self.tokenizer(text, add_special_tokens=add_special_tokens).input_ids
    ^^^^^^^^^^^^^^
AttributeError: 'OpenRouterPromptCompressor' object has no attribute 'tokenizer'

You can look at the code I tried to use in my GitHub repository...

iofu728 commented 9 months ago

Hi @JiHa-Kim, thank you for your help and efforts.

I haven't tried using GGUF with LLMLingua yet, but I believe there shouldn't be any major block issues. Also, a special thanks to @TechnotechGit, who is currently assisting in making Llama cpp compatible with LLMLingua. I'm confident this will facilitate support for models in GGUF format.

Regarding the second issue, it seems to stem from the lack of a defined tokenizer in OpenRouterPromptCompressor. You might try initializing a tokenizer using titoken. However, I suspect there might be some additional errors to address later on.

JiHa-Kim commented 9 months ago

Thanks I am excited to get this working properly.

JiHa-Kim commented 9 months ago

Well it seems like I managed to get the model loaded using llama-cpp-python with the new code in my repository but now I hit this error and I am stuck.

Traceback (most recent call last):
  File "C:\Users\Public\Coding\LLMLingua\LLMLingua_test1.py", line 48, in <module>
    compressed_prompt = llm_lingua.compress_prompt(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\local_prompt_compressor.py", line 230, in compress_prompt
    context = self.iterative_compress_prompt(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\local_prompt_compressor.py", line 712, in iterative_compress_prompt
    loss, past_key_values = self.get_ppl(
                            ^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\local_prompt_compressor.py", line 83, in get_ppl
    response = self.model(
               ^^^^^^^^^^^
TypeError: Llama.__call__() got an unexpected keyword argument 'attention_mask
iofu728 commented 9 months ago

Hi @JiHa-Kim, currently, calling the llama cpp model may not be supported, or it might require modifying the 'call' parameter in PromptCompressor.