microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.27k stars 228 forks source link

Question about LongLLMLingua token-level compression #73

Closed eunseongc closed 6 months ago

eunseongc commented 6 months ago

Thanks for sharing your interesting research. I'm reproducing LongLLMLingua, and I'm reaching out to ask about token-level prompt compression. I understood that after sorting for 20 documents, pruning based on context budget and token-level compression based on dynamic ratio is performed. However, it seems that for the last document, e.g., 13th in sorted 5-16-10-13th documents, token-level compression is performed and then suddenly stops. I suspect a similar cause, when I put only one document, set sentence_level_filtering and context_level_filtering to False and compress, it sometimes does not perform compression regardless of the target_token parameter. I wonder if this is intended.

Here's an example of the input and output.

Input command

compressed_prompt = llm_lingua.compress_prompt(  
        demonstration_str.split("\n"),   
        instruction=instruction,
        question=question, 
        target_token=500,  
        condition_compare=True,   
        condition_in_question='after',    
        rank_method='longllmlingua',
        use_sentence_level_filter=False,  
        context_budget="+100",    
        dynamic_context_compression_ratio=0.4,    
        reorder_context="sort")

Here, the demonstration, instruction and question are from the 1st example of the test set where the question is "who got the first nobel prize in physics."

And here is the compressed output I got.


Write a high-quality answer for the given question using only the provided search results (some of which might be irrelevant).

Document [6](Title: Nobel Prize in Physics) rendered by the discovery of the remarkable rays (or x-rays). This award is administered by the Nobel Foundation and widely regarded as the most prestigious award that a scientist can receive in physics. It is presented in Stockholm at an annual ceremony on 10 December, the anniversary of Nobel's death. Through 2018, a total of 209 individuals have been awarded the prize. Only three women (1.4% of laureates) have won the Nobel Prize in Physics: Marie Curie in 1903, Maria Goeppert Mayer in 1963, and Donna Strickland in 2018. Alfred Nobel, in his last will and testament, stated that his

Document [16](Title in Physics) death (1833–1896' portrait also appears on the obverse of Peace Prize and the Medal for the Prize Economics slightly different design. The on the reverse of var according to awarding the prize. The sides of Nobel Prize medals Chemistry and the same Nature, as a Goddess, whose veil is held up by the Genius of Science. These medals the ones for Physiology/Medicine and Literature designed by Erik Lindberg in 1902 laure receive a dipl directly from the 1: ofates in) The Nobel Prize in in 1 to Wilhelm Rö Germany,5 SEK is 770 SEK December 207 John Bardeen la twicein95 9. Skłod-Curie won Priz for103 and chem11. William Bragg was, until0, the young the195 at . women won the prize: MariappertM (963 of 207, the 1 Prize) A group including writers, against, having. Some, including Burton Feldman, have criticised this prize because they consider Prudhomme a mediocre poet. Feldman's explanation is that most of the Academy members preferred Victorian literature and thus selected a Victorian poet. The first Physiology or Medicine Prize went to the German physiologist and microbiologist Emil von Behring. During the 1890s, von Behring developed an antitoxin to treat diphtheria, which until then was causing thousands of deaths each year. The first Nobel Peace Prize went to the Swiss

Question: who got the first nobel prize in physics Answer:


The bold italic part denotes the last document survived, where I think the biggest ratio of compression should be in place.



Below is the second example when compressing only one document.

This is my input command.

compressed_prompt = llm_lingua.compress_prompt(
        demonstration_str.split("\n"),
        instruction=instruction,
        question=question,
        target_token=20,
        condition_compare=True,
        condition_in_question='after',
        use_sentence_level_filter=False,
        use_context_level_filter=False, )

Output:

[Original prompt]
Write a high-quality answer for the given question using only the provided search results (some of which might be irrelevant).

Document [1](Title: Philadelphia Eagles) The Philadelphia Eagles are a professional American football franchise based in Philadelphia, Pennsylvania. The Eagles compete in the National Football League (NFL) as a member club of the league's National Football Conference (NFC) East division. They are Super Bowl champions, having won Super Bowl LII, their fourth NFL title, after winning in 1948, 1949, and 1960.

Question: when is the last time the philadelphia won the superbowl
Answer:
############################################################################
[Compressed prompt]
Write a high-quality answer for the given question using only the provided search results (some of which might be irrelevant).

Document [1](Title: Philadelphia Eagles) The Philadelphia Eagles are a professional American football franchise based in Philadelphia, Pennsylvania. The Eagles compete in the National Football League (NFL) as a member club of the league's National Football Conference (NFC) East division. They are Super Bowl champions, having won Super Bowl LII, their fourth NFL title, after winning in 1948, 1949, and 1960.

Question: when is the last time the philadelphia won the superbowl
Answer:

-----------
Statistics: {'origin_tokens': 127, 'compressed_tokens': 127, 'ratio': '1.0x', 'origin_tokens_context': 87, 'compressed_tokens_context': 87, 'ratio_context': '1.0x'}

In the second example, changing the target_token parameter yielded the same results.

iofu728 commented 6 months ago

Hi @eunseongc,

Thank you for your support and for raising a great question. Indeed, the current code tends to retain more content in the last segment, sometimes not compressing it at all. This is mainly due to the impact of the termination condition on other variables as seen https://github.com/microsoft/LLMLingua/blob/main/llmlingua/prompt_compressor.py#L760. For simplicity, and considering the minimal impact of the last segment, we have temporarily retained this implementation.

We plan to address this issue in the future. Thank you again for your support.

eunseongc commented 6 months ago

Thanks, that clears up my question.