Despite the successes of large language models (LLMs), they exhibit significant drawbacks, particularly when processing long contexts. Their inference cost scales quadratically with respect to sequence length, making it expensive for deployment in some real-world text processing applications, such as retrieval-augmented generation (RAG). Additionally, LLMs also exhibit the "distraction phenomenon", where irrelevant context in the prompt degrades output quality. To address these drawbacks, we propose a novel RAG prompting methodology, superposition prompting, which can be directly applied to pre-trained transformer-based LLMs without the need for fine-tuning. At a high level, superposition prompting allows the LLM to process input documents in parallel prompt paths, discarding paths once they are deemed irrelevant. We demonstrate the capability of our method to simultaneously enhance time efficiency across a variety of question-answering benchmarks using multiple pre-trained LLMs. Furthermore, our technique significantly improves accuracy when the retrieved context is large relative the context the model was trained on. For example, our approach facilitates a 93x reduction in compute time while improving accuracy by 43% on the NaturalQuestions-Open dataset with the MPT-7B instruction-tuned model over naive RAG.
Quadratic scaling of inference cost:
The inference cost of LLMs scales quadratically with respect to sequence length.
This makes deployment expensive for real-world text processing applications, especially those involving long contexts.
The "distraction phenomenon":
LLMs suffer from a problem where irrelevant context in the prompt degrades the output quality.
This suggests that LLMs can be sensitive to noise or irrelevant information in the input, potentially leading to lower quality outputs.
Value of Feature
Advantages
Improved Efficiency: Demonstrates significant reduction in compute time across various question-answering benchmarks.
Enhanced Accuracy: Particularly effective when the retrieved context is large relative to the model's training context.
Versatility: Applicable to multiple pre-trained LLMs.
Case Study: NaturalQuestions-Open Dataset
Using the MPT-7B instruction-tuned model:
93x reduction in compute time
43% improvement in accuracy
Implications
This methodology addresses two critical challenges in LLM deployment:
The computational cost of processing long contexts
The negative impact of irrelevant information on output quality
Feature Description
Despite the successes of large language models (LLMs), they exhibit significant drawbacks, particularly when processing long contexts. Their inference cost scales quadratically with respect to sequence length, making it expensive for deployment in some real-world text processing applications, such as retrieval-augmented generation (RAG). Additionally, LLMs also exhibit the "distraction phenomenon", where irrelevant context in the prompt degrades output quality. To address these drawbacks, we propose a novel RAG prompting methodology, superposition prompting, which can be directly applied to pre-trained transformer-based LLMs without the need for fine-tuning. At a high level, superposition prompting allows the LLM to process input documents in parallel prompt paths, discarding paths once they are deemed irrelevant. We demonstrate the capability of our method to simultaneously enhance time efficiency across a variety of question-answering benchmarks using multiple pre-trained LLMs. Furthermore, our technique significantly improves accuracy when the retrieved context is large relative the context the model was trained on. For example, our approach facilitates a 93x reduction in compute time while improving accuracy by 43% on the NaturalQuestions-Open dataset with the MPT-7B instruction-tuned model over naive RAG.
https://arxiv.org/abs/2404.06910 https://github.com/apple/ml-superposition-prompting?tab=readme-ov-file
Reason
Quadratic scaling of inference cost: The inference cost of LLMs scales quadratically with respect to sequence length. This makes deployment expensive for real-world text processing applications, especially those involving long contexts.
The "distraction phenomenon": LLMs suffer from a problem where irrelevant context in the prompt degrades the output quality. This suggests that LLMs can be sensitive to noise or irrelevant information in the input, potentially leading to lower quality outputs.
Value of Feature
Advantages Improved Efficiency: Demonstrates significant reduction in compute time across various question-answering benchmarks. Enhanced Accuracy: Particularly effective when the retrieved context is large relative to the model's training context. Versatility: Applicable to multiple pre-trained LLMs.
Case Study: NaturalQuestions-Open Dataset Using the MPT-7B instruction-tuned model:
93x reduction in compute time 43% improvement in accuracy
Implications This methodology addresses two critical challenges in LLM deployment: The computational cost of processing long contexts The negative impact of irrelevant information on output quality