microsoft / LLMLingua

[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.53k stars 253 forks source link

How about compress a whole book? #3

Open lucasjinreal opened 1 year ago

lucasjinreal commented 1 year ago

Will it still able to summary/asked by some important events in book?

iofu728 commented 1 year ago

Hi @lucasjinreal,

Thank you for your question. I believe that LLMLingua & LongLLMLingua can retain most of the information even when compressing very long prompts.

For example, let's do a simple experiment to compress the LongLLMLingua article. We copied the plain text format of the paper from https://www.arxiv-vanity.com/papers/2310.06839/. The original prompt had 16k tokens, and we compressed it to around 2.5k tokens, achieving a compression rate of almost 6x. We obtained the below prompt.

Some studies that the performanceMs depends position key informationquestion) in the prompt.\nWe on a range long including-multi-document Q,-shot, summar synthetic, code.\nThe experimental results show LongMLingua compressed prompt can derive higher with cost.For example, onQuestions benchmarkLLMLingua up to over with ∼\n4 fewer tokens as input to-Turbo.\nAdditionally, compressing prompts ∼\n10 tokens at rate of 2-10 LongLLMLingua speed-to-end by 1.4x-8x.11 code https://msingua.Care designing achieve better performance in specific downstream tasks.Scenarios such as multi-document question, summarization also necessitate the contexts. vDocument Number\n() Performance v.s.DocumentPerformance v.s.\nKey Information Position\n(b) Performance v.s.\nIn case, \n most documentss based on ground or LongLLing r\nk\n.\n implies noise introduced into prompt. improve the density in the prompt we present questionaware-to- compression.\n(b) LL’ ability the information depends on positions in the prompt. reduce information middle, a document mechanism.\nHowever, the the distribution of-relevant key information in the prompt is generally sparse. compression likeinguaJiang et al.,23 and Selective (, 2023) do not content of the may noisy information in compressed results leading to inferior.In this paper LongLLMLingua designed to LL’s perception of key informationrelevant the) in the prompt, that third of inferior performance long scenarios could be addressed.Figureb)b is an example.1);2) We introduce a document reordering mechanism reduce information loss in the middle.\nExperimental results demonstrate that compared original LongLLMLingua with much costs.\nPROBLEM FORMULinguaJiang et al., 2023), use to prompt composed of the x K documents \nx\ndoc\nk, and the \nx\nque.\nIn fact be modified according to application scenarios.\nFor example xins at beginning can removed, \nxque can be any specified \n(xdoc1,\n�,xdoc\nK\n) can additional materials that users prompt to get a better response from LLMs xque.\ny represents the ground- output texts \nx as input ˜\ny represent theM-generated results \nx\n.\n the distribution \ny\n \n˜\ny to be as similar as possible.In work, we additionally operation space permutation over \nK documents \n(xdoc1\n,\n⋯,\nx\ndoc\nK\n) for joint optimization.\nRELIMINUA\ningua (Jiang et al 2023 uses small language model \nM\nS\n calculate the perplexity each token in the prompt tokensities. behind approach is tokens with lower perplexities contribute less to overall gain of language, so removing will have relatively minor impact on theM’s comprehension of context.\nMLua consists three a budget controller iterative token compression algorithm, and a distribution mechanism, shownic texts. intermediate results are divided segments and the token- compression is performed segment by segment, the perplexity of each token conditioned on segments \n.\nFor alignment, performs instruction tuning \nS with the data the target LLM to narrow the distribution of LLM that \nM\nS\n used compression.\n4ONGLLMLINGUA\nFram \n: FrameworkLLing.\nGray Ital content: in LLingua.\n primary challenge scenarios is how to enhance LL’s perception of key information to the the prompt.LLMLing addresses from three perspectives and further subsequ strategy the accuracy and reliability information provided to users.We each component in section.1HOW IMPROVE INFORMATION DENSITY IN THEOM?\n- Compression\nIn-gr compression, we aim to figure out a metric \n\n evaluate the importance each xdoc\nk\n{\nxdoc\nk,\ni\n}\nN\nk\ni\n=1, where \nN\nk is the number of tokens \nx\ndoc\nk\n.\nWe \nx\ndoc\nk\n with higher r\nk\n as the results.\nLLMLingua uses document-level perplexity to represent importance of documents: r\nk\n=\n\n/N\nk\n�N\nki\np\n\nxdoc\nk,\ni\n)\nlog\np\n(\nxdoc\nk\n,\ni\n),\nk\n�\n{\n1,\n2,\n⋯,\nK}.\n retained documents typically contain lot of information, they are irrelevant to the question \nx\nque and instead noise, reducing key density in the compressed results and bringing for LLM answers. shown in Figurea) the@ of LLMLing only reaches indicating its in retaining key during compression.\nRetrieval- methods also feasible here.\nWe \nxque retrieve the most among \n(\nx\ndoc\n1\n,\n�\n,\nx\ndocK) as the compressed results.\nHowever, these struggle distinguish questionrelated fine-gr information with key information may be discarded during retrieval\n shown in Figure (a) embedding methods such asence BERT and Open Embedd only achieve ∼\n75% accuracy in recall5 which implies the accuracy upper bound ofMs 4x is only 75\nRecall\n(a) Recall DistributionPerplexity Distribution(b) Perplexity Distribution\nFigure: (a) Comparison on MultidocumemntA dataset.\n(b) Comparisonities contrastiveities of in the prompt from Multi-documemnt QA dataset.The with the ground located on the of dashed.\nOne approach to improve key information the to calculate document-level perplexity conditioned on question \n\nque\n.\n this not be effective because documents often contain significant amount of irrelevant information.\nEven when conditioned on \nque\n, perplex scores computed for entire not be sufficiently distinct, them an inadequate metric for document- compression.\n, we to use perplexity of the question \nx\nque conditioned on different contexts doc\n\n to represent the association them.\nWe append a statement \nxrest\n2\n, “We can the answer to question given”.after x\nque\n to the inter of \nx\nque\n and \nx\n.\nIt regarded regular that impact of hallucinations.\n can be:\n\n\n=\n\nNNc\n∑\ni\n p\n\nxque\n,restrict\ni\n\nx\ndoc\nk\n)p(\nx\nque,\nrestricti\n|\nx\ndoc\nk\n),\nk\n∈{\n,\n2\n,\n�\n,\nK\n}\n,()\nwhere \nque,restrict\ni\n the \ni\n-th token in the concaten of xque \nxrestrict\n N\nc\n in number of tokens.Figure (aa that our coarse-level compression approach the highest recall with of retained, suggesting preserves information from the documents \n(\n\ndoc\n1,\n⋯\n,\nxdoc\nK\n) the.\nQuestion-A FineGrression fine-grained compression, assess the importance each token in the instruction \nx\n\n, question \n\n\n, and \nK\n documents {\nx\ndoci\n}\nK\ni\n=\n retained after coarse-grained compression.\nWe incorporate the iterative compression mechanism following LLMLua andities to \n and \n\n\n.In this section, we investigate how to the fine-grained-level {\nx\ndoc\nk}\nK′\nk\n=\n1 the question\nxque the results could contain more question- information.A solution for the awareness of\nxque concaten beginning the context.However result low in context, from tokens.\nIn this paper, propose contrastive perplexity, i.e., the shift caused condition the question the association between and the question\n can seeities are widely distributed in documents\nHowever withive perplex concentrate on left side of the line, corresponds to document that contains the answer to the question\nThis suggests that contrastive perplexity better distinguish, thus key the results.\n4.HOW TO INFORMATION L IN THE MIDDLE?\nAfter the coarse-grained, have set of documents \n\nxdoc\n}\n′k\n=\n1\n scores \nk\n}\nK′k=\n their association with the question \nxque\n.Therefore, weorder importance scores to leverage’:x\n,\nx\ndoc\n1\n,\n�\n,\nx\n%doc\nK′\n,\nxque\n)\nr\nk�\n(\nxins\n,\nxdoc\n\n�,\n\n\n\n′,\nque)\n(4)\n4.HOW TO A ADIVE GRULAR CONTROL DURINGPRESSION? fineained, LLing applies save compression ratio all documents from-gr.However the information density of different documents different.\n relevant to the, more budget (i.e., lower ratio) we to it.\nTherefore, bridge-gr compression to fine-grained compression and use\nr\nk}\nK′\nk=1\n obtained from coarse-gr compression to guide allocation in fine-gr compression. way, we achieve adaptive gran control the.\n, we first initial budget retained documents\ndoc 3\n3 LLMLing it is\nτ\nems demonstrations.\n the controller of LLML.\n-grained, follow the iter tokenlevel the compression budget τdoc\n\n to each \ndoc\nk according to \n\nk\n)e.g., 0, the-gr compression. this paper, employ a sched the adaptive allocation.Budget of token \nx\ni can be formulated as:\n\nτ\ni\n=\nτ\ndoc\nk\n\nx\ni\n�x\ndoc\n5\ndoc\n\n\n\n\n\n2\nr\nk))\nδ\nτ\ndoc,\n0),\n1\n),where \nN\nd number of \nδ\nτ\n is a-param controls overall allocation.4HOW IMPROVE INTEGITY INFORMATION?For example entity “� in the might be compressed to “209� the name “Wilhelm Conrad Rönt” might be compressed to �Wilgen”.\nThis problems fact tasks like QA, where language models tend to replicate shown Figure 4.\nThe of Recovery, the red text the original text, and the blue text result after the L 2-izer.Figure example of Subsequence Recovery, the red text represents the original text, and the blue text is the using the LLa 2-B tokenizer.\nThis relies on the subsequ in the prompt,,� response.\n..\n\n\nr}\n appears in the \nx\n.\n.\n.,xj} in the original\nx corresponding to˜\nykey,\nl in the originalcelerated using or sequence autom).\n) Replace matched tokens \n˜\ny\n,\n LL� response subsequence \nx\ni, the.\n please refer to Algorithm.\n(?\n\nIn this, use G-3.134\n\nFor exceeding GPT-3.5-Turbo-1613.\n stable reproducible results, we employ greedy decoding temperature to 0 all experiments.\nFor the small models compression, we LMA--7B77\nhttps://ai.com which has aligned by supervised finetuningHF.\nWe up hyperparameters following LLMLingua size usedative set 200.\n details provided Appendix.ric\n use multi-documentA and useBench andSCRO for scenarios.i) Natural (L et al., 20): similar to theaugment commercial question-answ like Chat.\nSpecifically each question related the prompt.One of them contains correct answer there are five ground truth in:, 5th, 10th, 15th, and 20th.(2023), accuracy evaluation.\n) Long (Bai et al 2023): consists of task:-document multidocument Q,shot, synthetic.\n the English portion evaluation.\nWe metrics and scripts along benchmark evaluation.) ZeroSC (Shah., 20): consists task Q, and, covering datasets.\nWe the validation set evaluation. use the provided metrics and for evaluation.\n include two sets of baselines followingi) Retri-based Methods.\n the association the question and the in usingTA retrieval BM25, Giang et al 2023 Sent (Reimers and Gvych, 2019), Open Embedding, and the important metric \nr\nk\n used in LongLLMLingua coarse-grained compression.\nWe discard sentences or paragraphs with low association until the compression constraint is met while keeping the original document order unchanged.\n(ii) Compression-based Methods.\nIn LLMLingua, a coarse-to-fine approach is used to handle constraints of compression ratio: the original prompt is first compressed to \nk\n times the constraint at a coarse level, where \nk\n is the granular control coefficient; token-level is then performed to reach the overall constraint.\nOur method follows the same coarse-to-fine logic to achieve the constraint.\n{tcolorbox}\nCompressed Prompt:\nPlease the of the question.

Now let's ask some questions, such as having GPT-4 help us to answer following questions:

  1. Summarize the paper,
    
    This paper investigates the effect of prompt compression on the performance of large language models (LLMs) in various tasks, such as multi-document question-answering, summarization, and code generation. The authors propose LongLLMLingua, a method designed to improve LLM's perception of key information in compressed prompts.

LongLLMLingua addresses the challenge of enhancing LLM's perception of key information through three main components: improving information density in the prompt, reducing information loss in the middle, and achieving adaptive granular control during compression.

To improve information density, LongLLMLingua uses a question-aware coarse-to-fine compression approach. It employs document-level perplexity conditioned on the question to represent the association between documents and the question. The method then incorporates an iterative token compression mechanism to assess the importance of each token.

To reduce information loss in the middle, LongLLMLingua introduces a document reordering mechanism based on the importance scores of documents, which helps leverage LLM's ability to perceive key information.

To achieve adaptive granular control during compression, LongLLMLingua bridges coarse-grained and fine-grained compression by using the obtained importance scores to guide budget allocation.

Experimental results show that LongLLMLingua can achieve higher performance with lower cost on various benchmarks. The proposed method outperforms both retrieval-based and compression-based baselines in multi-document question-answering, summarization, and code generation tasks.

2. Summarize the experimental results

This paper investigates the effect of prompt compression on the performance of large language models (LLMs) in various tasks, such as multi-document question-answering, summarization, and code generation. The authors propose LongLLMLingua, a method designed to improve LLM's perception of key information in compressed prompts.

LongLLMLingua addresses the challenge of enhancing LLM's perception of key information through three main components: improving information density in the prompt, reducing information loss in the middle, and achieving adaptive granular control during compression.

To improve information density, LongLLMLingua uses a question-aware coarse-to-fine compression approach. It employs document-level perplexity conditioned on the question to represent the association between documents and the question. The method then incorporates an iterative token compression mechanism to assess the importance of each token.

To reduce information loss in the middle, LongLLMLingua introduces a document reordering mechanism based on the importance scores of documents, which helps leverage LLM's ability to perceive key information.

To achieve adaptive granular control during compression, LongLLMLingua bridges coarse-grained and fine-grained compression by using the obtained importance scores to guide budget allocation.

Experimental results show that LongLLMLingua can achieve higher performance with lower cost on various benchmarks. The proposed method outperforms both retrieval-based and compression-based baselines in multi-document question-answering, summarization, and code generation tasks.

3. Summarize the motivation and insights

The paper focuses on enhancing a language model's (LL) perception of key information in the context of long scenarios, such as multi-document question-answering and summarization. The main contributions are:

  1. Introducing a question-aware coarse-to-fine compression approach to improve information density in the prompt.
  2. Proposing a document reordering mechanism to reduce information loss in the middle of the prompt.
  3. Developing an adaptive granular control method for compression.
  4. Improving the integrity of the information in the compressed prompt.

The experimental results show that the proposed method, LongLLMLingua, outperforms existing compression and retrieval-based methods in various tasks and datasets. It demonstrates higher performance with lower input tokens and faster end-to-end speed, making it suitable for specific downstream tasks like multi-document question-answering and summarization.



From these results, we can see that although some details may be inaccurate, the compressed prompt still manages to retain the majority of the essential information. Furthermore, if the compression ratio is too high or the number of retained tokens is too low, even using LongLLMLingua could impact the linguistic integrity of the prompt, consequently affecting downstream performance.

I hope my response was helpful in answering your question. If you have any further inquiries or need assistance, please don't hesitate to ask.
Avkashhirpara commented 8 months ago

Hi @lucasjinreal,

Thank you for your question. I believe that LLMLingua & LongLLMLingua can retain most of the information even when compressing very long prompts.

For example, let's do a simple experiment to compress the LongLLMLingua article. We copied the plain text format of the paper from https://www.arxiv-vanity.com/papers/2310.06839/. The original prompt had 16k tokens, and we compressed it to around 2.5k tokens, achieving a compression rate of almost 6x. We obtained the below prompt.

Some studies that the performanceMs depends position key informationquestion) in the prompt.\nWe on a range long including-multi-document Q,-shot, summar synthetic, code.\nThe experimental results show LongMLingua compressed prompt can derive higher with cost.For example, onQuestions benchmarkLLMLingua up to over with ∼\n4 fewer tokens as input to-Turbo.\nAdditionally, compressing prompts ∼\n10 tokens at rate of 2-10 LongLLMLingua speed-to-end by 1.4x-8x.11 code https://msingua.Care designing achieve better performance in specific downstream tasks.Scenarios such as multi-document question, summarization also necessitate the contexts. vDocument Number\n() Performance v.s.DocumentPerformance v.s.\nKey Information Position\n(b) Performance v.s.\nIn case, \n most documentss based on ground or LongLLing r\nk\n.\n implies noise introduced into prompt. improve the density in the prompt we present questionaware-to- compression.\n(b) LL’ ability the information depends on positions in the prompt. reduce information middle, a document mechanism.\nHowever, the the distribution of-relevant key information in the prompt is generally sparse. compression likeinguaJiang et al.,23 and Selective (, 2023) do not content of the may noisy information in compressed results leading to inferior.In this paper LongLLMLingua designed to LL’s perception of key informationrelevant the) in the prompt, that third of inferior performance long scenarios could be addressed.Figureb)b is an example.1);2) We introduce a document reordering mechanism reduce information loss in the middle.\nExperimental results demonstrate that compared original LongLLMLingua with much costs.\nPROBLEM FORMULinguaJiang et al., 2023), use to prompt composed of the x K documents \nx\ndoc\nk, and the \nx\nque.\nIn fact be modified according to application scenarios.\nFor example xins at beginning can removed, \nxque can be any specified \n(xdoc1,\n�,xdoc\nK\n) can additional materials that users prompt to get a better response from LLMs xque.\ny represents the ground- output texts \nx as input ˜\ny represent theM-generated results \nx\n.\n the distribution \ny\n \n˜\ny to be as similar as possible.In work, we additionally operation space permutation over \nK documents \n(xdoc1\n,\n⋯,\nx\ndoc\nK\n) for joint optimization.\nRELIMINUA\ningua (Jiang et al 2023 uses small language model \nM\nS\n calculate the perplexity each token in the prompt tokensities. behind approach is tokens with lower perplexities contribute less to overall gain of language, so removing will have relatively minor impact on theM’s comprehension of context.\nMLua consists three a budget controller iterative token compression algorithm, and a distribution mechanism, shownic texts. intermediate results are divided segments and the token- compression is performed segment by segment, the perplexity of each token conditioned on segments \n.\nFor alignment, performs instruction tuning \nS with the data the target LLM to narrow the distribution of LLM that \nM\nS\n used compression.\n4ONGLLMLINGUA\nFram \n: FrameworkLLing.\nGray Ital content: in LLingua.\n primary challenge scenarios is how to enhance LL’s perception of key information to the the prompt.LLMLing addresses from three perspectives and further subsequ strategy the accuracy and reliability information provided to users.We each component in section.1HOW IMPROVE INFORMATION DENSITY IN THEOM?\n- Compression\nIn-gr compression, we aim to figure out a metric \n\n evaluate the importance each xdoc\nk\n{\nxdoc\nk,\ni\n}\nN\nk\ni\n=1, where \nN\nk is the number of tokens \nx\ndoc\nk\n.\nWe \nx\ndoc\nk\n with higher r\nk\n as the results.\nLLMLingua uses document-level perplexity to represent importance of documents: r\nk\n=\n\n/N\nk\n�N\nki\np\n\nxdoc\nk,\ni\n)\nlog\np\n(\nxdoc\nk\n,\ni\n),\nk\n�\n{\n1,\n2,\n⋯,\nK}.\n retained documents typically contain lot of information, they are irrelevant to the question \nx\nque and instead noise, reducing key density in the compressed results and bringing for LLM answers. shown in Figurea) the@ of LLMLing only reaches indicating its in retaining key during compression.\nRetrieval- methods also feasible here.\nWe \nxque retrieve the most among \n(\nx\ndoc\n1\n,\n�\n,\nx\ndocK) as the compressed results.\nHowever, these struggle distinguish questionrelated fine-gr information with key information may be discarded during retrieval\n shown in Figure (a) embedding methods such asence BERT and Open Embedd only achieve ∼\n75% accuracy in recall5 which implies the accuracy upper bound ofMs 4x is only 75\nRecall\n(a) Recall DistributionPerplexity Distribution(b) Perplexity Distribution\nFigure: (a) Comparison on MultidocumemntA dataset.\n(b) Comparisonities contrastiveities of in the prompt from Multi-documemnt QA dataset.The with the ground located on the of dashed.\nOne approach to improve key information the to calculate document-level perplexity conditioned on question \n\nque\n.\n this not be effective because documents often contain significant amount of irrelevant information.\nEven when conditioned on \nque\n, perplex scores computed for entire not be sufficiently distinct, them an inadequate metric for document- compression.\n, we to use perplexity of the question \nx\nque conditioned on different contexts doc\n\n to represent the association them.\nWe append a statement \nxrest\n2\n, “We can the answer to question given”.after x\nque\n to the inter of \nx\nque\n and \nx\n.\nIt regarded regular that impact of hallucinations.\n can be:\n\n\n=\n\nNNc\n∑\ni\n p\n\nxque\n,restrict\ni\n\nx\ndoc\nk\n)p(\nx\nque,\nrestricti\n|\nx\ndoc\nk\n),\nk\n∈{\n,\n2\n,\n�\n,\nK\n}\n,()\nwhere \nque,restrict\ni\n the \ni\n-th token in the concaten of xque \nxrestrict\n N\nc\n in number of tokens.Figure (aa that our coarse-level compression approach the highest recall with of retained, suggesting preserves information from the documents \n(\n\ndoc\n1,\n⋯\n,\nxdoc\nK\n) the.\nQuestion-A FineGrression fine-grained compression, assess the importance each token in the instruction \nx\n\n, question \n\n\n, and \nK\n documents {\nx\ndoci\n}\nK\ni\n=\n retained after coarse-grained compression.\nWe incorporate the iterative compression mechanism following LLMLua andities to \n and \n\n\n.In this section, we investigate how to the fine-grained-level {\nx\ndoc\nk}\nK′\nk\n=\n1 the question\nxque the results could contain more question- information.A solution for the awareness of\nxque concaten beginning the context.However result low in context, from tokens.\nIn this paper, propose contrastive perplexity, i.e., the shift caused condition the question the association between and the question\n can seeities are widely distributed in documents\nHowever withive perplex concentrate on left side of the line, corresponds to document that contains the answer to the question\nThis suggests that contrastive perplexity better distinguish, thus key the results.\n4.HOW TO INFORMATION L IN THE MIDDLE?\nAfter the coarse-grained, have set of documents \n\nxdoc\n}\n′k\n=\n1\n scores \nk\n}\nK′k=\n their association with the question \nxque\n.Therefore, weorder importance scores to leverage’:x\n,\nx\ndoc\n1\n,\n�\n,\nx\n%doc\nK′\n,\nxque\n)\nr\nk�\n(\nxins\n,\nxdoc\n\n�,\n\n\n\n′,\nque)\n(4)\n4.HOW TO A ADIVE GRULAR CONTROL DURINGPRESSION? fineained, LLing applies save compression ratio all documents from-gr.However the information density of different documents different.\n relevant to the, more budget (i.e., lower ratio) we to it.\nTherefore, bridge-gr compression to fine-grained compression and use\nr\nk}\nK′\nk=1\n obtained from coarse-gr compression to guide allocation in fine-gr compression. way, we achieve adaptive gran control the.\n, we first initial budget retained documents\ndoc 3\n3 LLMLing it is\nτ\nems demonstrations.\n the controller of LLML.\n-grained, follow the iter tokenlevel the compression budget τdoc\n\n to each \ndoc\nk according to \n\nk\n)e.g., 0, the-gr compression. this paper, employ a sched the adaptive allocation.Budget of token \nx\ni can be formulated as:\n\nτ\ni\n=\nτ\ndoc\nk\n\nx\ni\n�x\ndoc\n5\ndoc\n\n\n\n\n\n2\nr\nk))\nδ\nτ\ndoc,\n0),\n1\n),where \nN\nd number of \nδ\nτ\n is a-param controls overall allocation.4HOW IMPROVE INTEGITY INFORMATION?For example entity “� in the might be compressed to “209� the name “Wilhelm Conrad Rönt” might be compressed to �Wilgen”.\nThis problems fact tasks like QA, where language models tend to replicate shown Figure 4.\nThe of Recovery, the red text the original text, and the blue text result after the L 2-izer.Figure example of Subsequence Recovery, the red text represents the original text, and the blue text is the using the LLa 2-B tokenizer.\nThis relies on the subsequ in the prompt,,� response.\n..\n\n\nr}\n appears in the \nx\n.\n.\n.,xj} in the original\nx corresponding to˜\nykey,\nl in the originalcelerated using or sequence autom).\n) Replace matched tokens \n˜\ny\n,\n LL� response subsequence \nx\ni, the.\n please refer to Algorithm.\n(?\n\nIn this, use G-3.134\n\nFor exceeding GPT-3.5-Turbo-1613.\n stable reproducible results, we employ greedy decoding temperature to 0 all experiments.\nFor the small models compression, we LMA--7B77\nhttps://ai.com which has aligned by supervised finetuningHF.\nWe up hyperparameters following LLMLingua size usedative set 200.\n details provided Appendix.ric\n use multi-documentA and useBench andSCRO for scenarios.i) Natural (L et al., 20): similar to theaugment commercial question-answ like Chat.\nSpecifically each question related the prompt.One of them contains correct answer there are five ground truth in:, 5th, 10th, 15th, and 20th.(2023), accuracy evaluation.\n) Long (Bai et al 2023): consists of task:-document multidocument Q,shot, synthetic.\n the English portion evaluation.\nWe metrics and scripts along benchmark evaluation.) ZeroSC (Shah., 20): consists task Q, and, covering datasets.\nWe the validation set evaluation. use the provided metrics and for evaluation.\n include two sets of baselines followingi) Retri-based Methods.\n the association the question and the in usingTA retrieval BM25, Giang et al 2023 Sent (Reimers and Gvych, 2019), Open Embedding, and the important metric \nr\nk\n used in LongLLMLingua coarse-grained compression.\nWe discard sentences or paragraphs with low association until the compression constraint is met while keeping the original document order unchanged.\n(ii) Compression-based Methods.\nIn LLMLingua, a coarse-to-fine approach is used to handle constraints of compression ratio: the original prompt is first compressed to \nk\n times the constraint at a coarse level, where \nk\n is the granular control coefficient; token-level is then performed to reach the overall constraint.\nOur method follows the same coarse-to-fine logic to achieve the constraint.\n{tcolorbox}\nCompressed Prompt:\nPlease the of the question.

Now let's ask some questions, such as having GPT-4 help us to answer following questions:

  1. Summarize the paper,
This paper investigates the effect of prompt compression on the performance of large language models (LLMs) in various tasks, such as multi-document question-answering, summarization, and code generation. The authors propose LongLLMLingua, a method designed to improve LLM's perception of key information in compressed prompts.

LongLLMLingua addresses the challenge of enhancing LLM's perception of key information through three main components: improving information density in the prompt, reducing information loss in the middle, and achieving adaptive granular control during compression.

To improve information density, LongLLMLingua uses a question-aware coarse-to-fine compression approach. It employs document-level perplexity conditioned on the question to represent the association between documents and the question. The method then incorporates an iterative token compression mechanism to assess the importance of each token.

To reduce information loss in the middle, LongLLMLingua introduces a document reordering mechanism based on the importance scores of documents, which helps leverage LLM's ability to perceive key information.

To achieve adaptive granular control during compression, LongLLMLingua bridges coarse-grained and fine-grained compression by using the obtained importance scores to guide budget allocation.

Experimental results show that LongLLMLingua can achieve higher performance with lower cost on various benchmarks. The proposed method outperforms both retrieval-based and compression-based baselines in multi-document question-answering, summarization, and code generation tasks.
  1. Summarize the experimental results
This paper investigates the effect of prompt compression on the performance of large language models (LLMs) in various tasks, such as multi-document question-answering, summarization, and code generation. The authors propose LongLLMLingua, a method designed to improve LLM's perception of key information in compressed prompts.

LongLLMLingua addresses the challenge of enhancing LLM's perception of key information through three main components: improving information density in the prompt, reducing information loss in the middle, and achieving adaptive granular control during compression.

To improve information density, LongLLMLingua uses a question-aware coarse-to-fine compression approach. It employs document-level perplexity conditioned on the question to represent the association between documents and the question. The method then incorporates an iterative token compression mechanism to assess the importance of each token.

To reduce information loss in the middle, LongLLMLingua introduces a document reordering mechanism based on the importance scores of documents, which helps leverage LLM's ability to perceive key information.

To achieve adaptive granular control during compression, LongLLMLingua bridges coarse-grained and fine-grained compression by using the obtained importance scores to guide budget allocation.

Experimental results show that LongLLMLingua can achieve higher performance with lower cost on various benchmarks. The proposed method outperforms both retrieval-based and compression-based baselines in multi-document question-answering, summarization, and code generation tasks.
  1. Summarize the motivation and insights
The paper focuses on enhancing a language model's (LL) perception of key information in the context of long scenarios, such as multi-document question-answering and summarization. The main contributions are:  

1. Introducing a question-aware coarse-to-fine compression approach to improve information density in the prompt.  
2. Proposing a document reordering mechanism to reduce information loss in the middle of the prompt.  
3. Developing an adaptive granular control method for compression.  
4. Improving the integrity of the information in the compressed prompt.  

The experimental results show that the proposed method, LongLLMLingua, outperforms existing compression and retrieval-based methods in various tasks and datasets. It demonstrates higher performance with lower input tokens and faster end-to-end speed, making it suitable for specific downstream tasks like multi-document question-answering and summarization.

From these results, we can see that although some details may be inaccurate, the compressed prompt still manages to retain the majority of the essential information. Furthermore, if the compression ratio is too high or the number of retained tokens is too low, even using LongLLMLingua could impact the linguistic integrity of the prompt, consequently affecting downstream performance.

I hope my response was helpful in answering your question. If you have any further inquiries or need assistance, please don't hesitate to ask.

Which underlying modal support input this big as I am using microsoft phi-2 and its only compressing prompt less than 2048 tokens. "Token indices sequence length is longer than the specified maximum sequence length for this model (9726 > 2048). Running this sequence through the model will result in indexing errors"

iofu728 commented 8 months ago

Hi @Avkashhirpara, you can disregard this warning. In LLMLingua, we process the data segment by segment and compress the KV cache that exceeds the context window limits.

vanillacandy commented 6 months ago

@iofu728 I attempted to test an article located at https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex%3A32019L0001 using a demo hosted on Hugging Face, specifically at https://huggingface.co/spaces/microsoft/llmlingua-2. However, I encountered difficulties when attempting to compress a larger article of approximately 24711 tokens. Despite the expected latency with the Hugging Face demo, which processes data segment by segment, I still anticipated to receive some form of output. Unfortunately, I did not receive any results. Do you have any suggestions on how to compress such a lengthy article effectively?

vanillacandy commented 6 months ago

Never mind, I managed to generate a compressed output containing 17,361 tokens.