[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
Hi, Teams!
Thanks so much for this inspiring work! I want to confirm if this is the correct way to only compress documents in the RAG setting:
from llmlingua import PromptCompressor
from nltk.tokenize import sent_tokenize
document = """
1972 | program: Eugene Cernan is the last person to walk on the Moon, after he and Harrison Schmitt complete the third and final Extra-vehicular activity (EVA) of Apollo 17. \
This is currently the last manned mission to the Moon. ; December 15 ; The Commonwealth of Australia ordains equal pay for women. ; The United Nations Environment Programme is \
established as a specialized agency of the United Nations. ; December 16 ; The Constitution of Bangladesh comes into effect. ; The Portuguese army kills 400 Africans in Tete, Mozambique. \
; December 19 – Apollo program: Apollo 17 returns to Earth, concluding the program of lunar exploration. ; December 21
"""
llm_lingua = PromptCompressor()
compressed_prompt = llm_lingua.compress_prompt(sent_tokenize(document), instruction="", question="", target_token = 10,)
Should I use sent_tokenize to get List[str]? How to precisely control the size of compressed documents? In the example above, I got 19 tokens even if I set target_token=10.
Hi @Hannibal046, thank you for your interest in LLMLingua.
The best practice for dividing the context is to do so semantically, such as at the document level; however, using “sent_tokenize(document)” is also viable.
Setting the compression target to 10 will be quite challenging. Additionally, our current method cannot guarantee complete adherence to the compression target, but aims to approach it as closely as possible. You can adjust the coefficients to get closer to your target.
For more principles, you can refer to this document.
Describe the issue
Hi, Teams! Thanks so much for this inspiring work! I want to confirm if this is the correct way to only compress documents in the RAG setting:
Should I use
sent_tokenize
to getList[str]
? How to precisely control the size of compressed documents? In the example above, I got 19 tokens even if I settarget_token=10
.Look forward to your response!