[Feature Request]: Superposition Prompting

Feature Description

Despite the successes of large language models (LLMs), they exhibit significant drawbacks, particularly when processing long contexts. Their inference cost scales quadratically with respect to sequence length, making it expensive for deployment in some real-world text processing applications, such as retrieval-augmented generation (RAG). Additionally, LLMs also exhibit the "distraction phenomenon", where irrelevant context in the prompt degrades output quality. To address these drawbacks, we propose a novel RAG prompting methodology, superposition prompting, which can be directly applied to pre-trained transformer-based LLMs without the need for fine-tuning. At a high level, superposition prompting allows the LLM to process input documents in parallel prompt paths, discarding paths once they are deemed irrelevant. We demonstrate the capability of our method to simultaneously enhance time efficiency across a variety of question-answering benchmarks using multiple pre-trained LLMs. Furthermore, our technique significantly improves accuracy when the retrieved context is large relative the context the model was trained on. For example, our approach facilitates a 93x reduction in compute time while improving accuracy by 43% on the NaturalQuestions-Open dataset with the MPT-7B instruction-tuned model over naive RAG.

https://arxiv.org/abs/2404.06910 https://github.com/apple/ml-superposition-prompting?tab=readme-ov-file

Reason

Quadratic scaling of inference cost: The inference cost of LLMs scales quadratically with respect to sequence length. This makes deployment expensive for real-world text processing applications, especially those involving long contexts.

The "distraction phenomenon": LLMs suffer from a problem where irrelevant context in the prompt degrades the output quality. This suggests that LLMs can be sensitive to noise or irrelevant information in the input, potentially leading to lower quality outputs.

Value of Feature

Advantages Improved Efficiency: Demonstrates significant reduction in compute time across various question-answering benchmarks. Enhanced Accuracy: Particularly effective when the retrieved context is large relative to the model's training context. Versatility: Applicable to multiple pre-trained LLMs.

Case Study: NaturalQuestions-Open Dataset Using the MPT-7B instruction-tuned model:

93x reduction in compute time 43% improvement in accuracy

Implications This methodology addresses two critical challenges in LLM deployment: The computational cost of processing long contexts The negative impact of irrelevant information on output quality

Hi, @Jeevi10. I'm Dosu, and I'm helping the LlamaIndex team manage their backlog. I'm marking this issue as stale.

Issue Summary:

You proposed a technique called "superposition prompting" to enhance LLM efficiency and accuracy with lengthy contexts.
The method processes input documents through parallel paths, filtering out irrelevant information.
It claims a 93x reduction in compute time and a 43% increase in accuracy on the NaturalQuestions-Open dataset.
No further comments or activities have been made on this issue.

Next Steps:

Please let us know if this issue is still relevant to the latest version of the LlamaIndex repository by commenting here.
If there is no further activity, this issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

run-llama / llama_index