[Question]: How does `RagDatasetGenerator` work?

EssamWisam commented 1 month ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I understand what it does and there is a helpful guide on it in the documentation. However, I could find no hint anywhere of how it's actually implemented (e.g., any reference paper).

This is important because there already other frameworks like DeepEval and Ragas that offer the same feature (but with more transparency on how it works) but which integrate poorly with Llama Index in my application.

dosubot[bot] commented 1 month ago

RagDatasetGenerator is a class designed to generate datasets consisting of questions or question-answer pairs based on given documents. Here is how it works:

Initialization:

The class is initialized with a list of nodes, an optional service context, the number of questions to generate per chunk, templates for question generation and question-answer generation, a query for generating questions, metadata mode, a flag to show progress, and the number of workers.
If the service context is not provided, it is obtained from defaults.

def __init__(
    self,
    nodes: List[BaseNode],
    service_context: ServiceContext | None = None,
    num_questions_per_chunk: int = 3,
    text_question_template: BasePromptTemplate | None = None,
    text_qa_template: BasePromptTemplate | None = None,
    question_gen_query: str | None = None,
    metadata_mode: MetadataMode = MetadataMode.NONE,
    show_progress: bool = False,
    workers: int = DEFAULT_NUM_WORKERS,
) -> None:
    # Initialization code

Dataset Generation from Documents:

The from_documents class method generates a dataset from a list of documents. It applies transformations to the documents, filters nodes based on keywords, and initializes the RagDatasetGenerator with the processed nodes.

@classmethod
def from_documents(
    cls,
    documents: List[Document],
    service_context: ServiceContext | None = None,
    num_questions_per_chunk: int = 3,
    text_question_template: BasePromptTemplate | None = None,
    text_qa_template: BasePromptTemplate | None = None,
    question_gen_query: str | None = None,
    required_keywords: List[str] | None = None,
    exclude_keywords: List[str] | None = None,
    show_progress: bool = False,
    workers: int = DEFAULT_NUM_WORKERS,
) -> RagDatasetGenerator:
    # Method code

Asynchronous Dataset Generation:
- The _agenerate_dataset method generates questions (and optionally answers) asynchronously for each node. It uses a query engine to generate questions and optionally answers, and then processes the responses to create LabelledRagDataExample instances.
```
async def _agenerate_dataset(
    self,
    nodes: List[BaseNode],
    labelled: bool = False,
) -> LabelledRagDataset:
    # Method code
```

Synchronous Dataset Generation:

The generate_questions_from_nodes and generate_dataset_from_nodes methods provide synchronous interfaces to generate questions and question-answer pairs, respectively, by running the asynchronous methods.

def generate_questions_from_nodes(self) -> LabelledRagDataset:
    return asyncio.run(self.agenerate_questions_from_nodes())

def generate_dataset_from_nodes(self) -> LabelledRagDataset:
    return asyncio.run(self.agenerate_dataset_from_nodes())

Prompt Management:

The class includes methods to get and update prompt templates used for question and question-answer generation.

def _get_prompts(self) -> PromptDictType:
    return {
        "text_question_template": self.text_question_template,
        "text_qa_template": self.text_qa_template,
    }

def _update_prompts(self, prompts: PromptDictType) -> None:
    if "text_question_template" in prompts:
        self.text_question_template = prompts["text_question_template"]
    if "text_qa_template" in prompts:
        self.text_qa_template = prompts["text_qa_template"]

In summary, RagDatasetGenerator is a flexible tool for generating educational datasets from documents, leveraging language models to create diverse and contextually relevant questions and answers. Note that this is a beta feature and is subject to change.

For example, you can use RagDatasetGenerator to create a synthetic LabelledRagDataset over Wikipedia articles. By using a language model like GPT-4, you can generate both the queries and reference answers for the dataset. This allows you to benchmark and evaluate different RAG pipeline configurations effectively. The generated dataset can be serialized to JSON for persistence and later reloaded for further analysis ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

EssamWisam commented 1 month ago

This does not explain the dynamics of how it works or if it implement an approach previously proposed in a paper.

dosubot[bot] commented 1 month ago

The RagDatasetGenerator implements several specific methodologies:

Question Generation: It uses a prompt-based approach to generate questions from document chunks. The default prompt for question generation is defined as DEFAULT_QUESTION_GENERATION_PROMPT.
Node Postprocessing: It uses a KeywordNodePostprocessor to filter nodes based on required and excluded keywords.
Transformation Pipeline: It applies transformations to documents to generate nodes, which are then used for question generation.

However, there are no explicit references to research papers or theoretical frameworks within the provided code or documentation ^[1]^[2].

To continue talking to Dosu, mention @dosu.

ArjunR27 commented 3 weeks ago

@dosu what does it cost for the RagDatasetGenerator to run for gpt-4?

run-llama / llama_index

[Question]: How does `RagDatasetGenerator` work? #14974

Question Validation

Question