feat: Adding chunk ids for grounded RAG

[X] I have searched the existing issues

Is your feature request related to a problem? Please describe it

I'm worried about the reliability of the generated answer to a RAG question.

Describe the solution

I want to be able to ground the RAG generated answer with references to the chunks supporting the answer, using for instance Nous Hermes 3 fine tunes of Llama 3.1 The implementation is as follows, (fig 8 of the linked document) :

System: You are a conversational AI assistant that is provided a list of documents and a user query to answer based on information from the documents. You should always use grounded information in your responses, only answering from what you can cite in the documents. Cite all facts from the documents using tags. User: What does Agent Instruct do? CONTEXT: Document:0 Title: AgentInstruct: Toward Generative Teaching with Agentic Flows Text: Synthetic data is becoming increasingly important for accelerating the development of language models ... Document:1 Title: Advances in Emergency Medical Technology Text: In recent years, emergency medical technology has seen significant advancements ... Document:2 Title: The Rise of Agentic Data Generation Text: AgentInstruct is an agentic framework by Mitra et al. (2024), designed to generate large-scale ...

This just requires ids and title of the source document to be provided in the context for each chunk.

Teachability, documentation, adoption, migration strategy

Either a toggle for adding the ids and document title, or a more detailed Retrieval Template allowing to specify these. Such a gui or Retrieval Template format could easily be added to the RAG documentation.

What is the motivation / use case for changing the behavior?

With the correct model, such a RAG context enables grounded generation with citations as per example :

Assisstant: AgentInstruct is an agentic framework designed to generate large-scale, diverse, and high-quality synthetic data. The framework uses a sophisticated pipeline that transforms raw text into refined instructions through multiple stages of processing. The pipeline consists of four main steps: Seed Collection, Content Transformation, Seed Instruction Generation, and Instruction Refinement. Each stage in the pipeline is designed to enhance the complexity and quality of the generated instructions, ultimately creating a dataset that can be used to fine-tune language models. Cited Documents: 2

janhq / jan