run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
34.57k stars 4.88k forks source link

[Question]: Would you recommend appropriately format document (Text) for RAG ? #13240

Closed Ninlawat-Puhu closed 1 week ago

Ninlawat-Puhu commented 3 months ago

Question Validation

Question

I use text as source for RAG and Could you recommend format text that is appropriate for RAG by using llamaindex?

For example, [1], I see format that is like text-pair question and answer

[1] https://github.com/reichenbch/RAG-examples/blob/main/dataset.txt#L212

dosubot[bot] commented 3 months ago

For formatting text documents for RAG using LlamaIndex, focusing on a text-pair question and answer format, consider these advanced practices:

  1. Separate Text for Retrieval and Synthesis: Use summaries or sentence embeddings for retrieval linked to detailed text chunks for synthesis, enhancing accuracy and relevance.
  2. Implement Structured Retrieval: Use metadata tagging or document hierarchies for efficient management of large document sets, improving search relevance.
  3. Adapt Retrieval Strategy by Task: Customize retrieval based on the query type, using different techniques for factual answers, summaries, or comparisons.
  4. Fine-tune Context Embeddings: Train or fine-tune embeddings on your corpus to capture data nuances, boosting retrieval performance.
  5. Use Advanced Query Engine Modules: Employ modules for complex queries, including structured data joining and query planning, to increase accuracy.
  6. Document and Evaluate: Maintain detailed documentation of your dataset and retrieval system, including structure, use cases, and performance benchmarks.
  7. Utilize Prompt Templates: Standardize input and output formats with prompt templates for consistency and efficiency.
  8. Handle Structured Data with Specialized Templates: For structured data, use templates designed for schema extraction, text to SQL conversion, and JSON path queries.

These practices, detailed in the LlamaIndex documentation and RAG dataset submission template, can significantly enhance the performance and scalability of your RAG application in production environments.

#### Details **Sources** - [docs/docs/use_cases/q_and_a/index.md](https://github.com/run-llama/llama_index/blob/main/docs/docs/use_cases/q_and_a/index.md) - [docs/docs/examples/llama_dataset/ragdataset_submission_template.ipynb](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/llama_dataset/ragdataset_submission_template.ipynb) - [docs/docs/optimizing/production_rag.md](https://github.com/run-llama/llama_index/blob/main/docs/docs/optimizing/production_rag.md) - [llama-index-legacy/llama_index/legacy/prompts/default_prompts.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/prompts/default_prompts.py) - [llama-index-core/llama_index/core/indices/document_summary/retrievers.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/indices/document_summary/retrievers.py) - [llama-index-core/llama_index/core/retrievers/__init__.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/retrievers/__init__.py) - [docs/docs/understanding/putting_it_all_together/agents.md](https://github.com/run-llama/llama_index/blob/main/docs/docs/understanding/putting_it_all_together/agents.md) - [llama-index-core/llama_index/core/retrievers/router_retriever.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/retrievers/router_retriever.py) - [docs/docs/module_guides/deploying/agents/index.md](https://github.com/run-llama/llama_index/blob/main/docs/docs/module_guides/deploying/agents/index.md) - [llama-index-legacy/llama_index/legacy/retrievers/router_retriever.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/retrievers/router_retriever.py) - [llama-index-core/llama_index/core/query_engine/sub_question_query_engine.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/query_engine/sub_question_query_engine.py) - [llama-index-core/llama_index/core/query_engine/__init__.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/query_engine/__init__.py) - [llama-index-legacy/llama_index/legacy/query_engine/__init__.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/query_engine/__init__.py) - [docs/docs/module_guides/models/prompts/index.md](https://github.com/run-llama/llama_index/blob/main/docs/docs/module_guides/models/prompts/index.md) - [llama-index-core/llama_index/core/prompts/prompts.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/prompts/prompts.py) - [llama-index-core/llama_index/core/prompts/default_prompts.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/prompts/default_prompts.py) - [llama-index-core/tests/mock_utils/mock_prompts.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/tests/mock_utils/mock_prompts.py) - [llama-index-legacy/llama_index/legacy/prompts/prompts.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/prompts/prompts.py) - [llama-index-core/llama_index/core/question_gen/output_parser.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/question_gen/output_parser.py) - [llama-index-legacy/llama_index/legacy/question_gen/prompts.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/question_gen/prompts.py) - [llama-index-legacy/llama_index/legacy/question_gen/output_parser.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/question_gen/output_parser.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.