Open EssamWisam opened 1 month ago
RagDatasetGenerator
is a class designed to generate datasets consisting of questions or question-answer pairs based on given documents. Here is how it works:
Initialization:
def __init__(
self,
nodes: List[BaseNode],
service_context: ServiceContext | None = None,
num_questions_per_chunk: int = 3,
text_question_template: BasePromptTemplate | None = None,
text_qa_template: BasePromptTemplate | None = None,
question_gen_query: str | None = None,
metadata_mode: MetadataMode = MetadataMode.NONE,
show_progress: bool = False,
workers: int = DEFAULT_NUM_WORKERS,
) -> None:
# Initialization code
Dataset Generation from Documents:
from_documents
class method generates a dataset from a list of documents. It applies transformations to the documents, filters nodes based on keywords, and initializes the RagDatasetGenerator
with the processed nodes.@classmethod
def from_documents(
cls,
documents: List[Document],
service_context: ServiceContext | None = None,
num_questions_per_chunk: int = 3,
text_question_template: BasePromptTemplate | None = None,
text_qa_template: BasePromptTemplate | None = None,
question_gen_query: str | None = None,
required_keywords: List[str] | None = None,
exclude_keywords: List[str] | None = None,
show_progress: bool = False,
workers: int = DEFAULT_NUM_WORKERS,
) -> RagDatasetGenerator:
# Method code
Asynchronous Dataset Generation:
_agenerate_dataset
method generates questions (and optionally answers) asynchronously for each node. It uses a query engine to generate questions and optionally answers, and then processes the responses to create LabelledRagDataExample
instances.async def _agenerate_dataset(
self,
nodes: List[BaseNode],
labelled: bool = False,
) -> LabelledRagDataset:
# Method code
Synchronous Dataset Generation:
generate_questions_from_nodes
and generate_dataset_from_nodes
methods provide synchronous interfaces to generate questions and question-answer pairs, respectively, by running the asynchronous methods.def generate_questions_from_nodes(self) -> LabelledRagDataset:
return asyncio.run(self.agenerate_questions_from_nodes())
def generate_dataset_from_nodes(self) -> LabelledRagDataset:
return asyncio.run(self.agenerate_dataset_from_nodes())
Prompt Management:
def _get_prompts(self) -> PromptDictType:
return {
"text_question_template": self.text_question_template,
"text_qa_template": self.text_qa_template,
}
def _update_prompts(self, prompts: PromptDictType) -> None:
if "text_question_template" in prompts:
self.text_question_template = prompts["text_question_template"]
if "text_qa_template" in prompts:
self.text_qa_template = prompts["text_qa_template"]
In summary, RagDatasetGenerator
is a flexible tool for generating educational datasets from documents, leveraging language models to create diverse and contextually relevant questions and answers. Note that this is a beta feature and is subject to change.
For example, you can use RagDatasetGenerator
to create a synthetic LabelledRagDataset
over Wikipedia articles. By using a language model like GPT-4, you can generate both the queries and reference answers for the dataset. This allows you to benchmark and evaluate different RAG pipeline configurations effectively. The generated dataset can be serialized to JSON for persistence and later reloaded for further analysis [1][2][3].
To continue talking to Dosu, mention @dosu.
This does not explain the dynamics of how it works or if it implement an approach previously proposed in a paper.
The RagDatasetGenerator
implements several specific methodologies:
Question Generation: It uses a prompt-based approach to generate questions from document chunks. The default prompt for question generation is defined as DEFAULT_QUESTION_GENERATION_PROMPT
.
Node Postprocessing: It uses a KeywordNodePostprocessor
to filter nodes based on required and excluded keywords.
Transformation Pipeline: It applies transformations to documents to generate nodes, which are then used for question generation.
However, there are no explicit references to research papers or theoretical frameworks within the provided code or documentation [1][2].
To continue talking to Dosu, mention @dosu.
@dosu what does it cost for the RagDatasetGenerator to run for gpt-4?
Question Validation
Question
I understand what it does and there is a helpful guide on it in the documentation. However, I could find no hint anywhere of how it's actually implemented (e.g., any reference paper).
This is important because there already other frameworks like DeepEval and Ragas that offer the same feature (but with more transparency on how it works) but which integrate poorly with Llama Index in my application.