run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.89k stars 5.09k forks source link

RagDatasetGenerator not generating questions like DatasetGenerator does [Bug]: #10997

Closed vecorro closed 7 months ago

vecorro commented 7 months ago

Bug Description

I'm trying to reproduce the code from the question generation example which even for LlamaIndex v0.10.7 documentation the example leverages DatasetGenerator (deprecated) instead of RagDatasetGenerator.

The problem is that if I use DatasetGenerator I get a set of questions very similar to example's expected output. However, if I use RagDatasetGenerator, I just get a bunch of text but no questions at all.

Version

v0.10.7

Steps to Reproduce

Code based on DatasetGenerator which shows a correct/expected behavior:

%%time
nest_asyncio.apply()

llm = OpenAILike(
    model="meta-llama/Llama-2-70b-chat-hf",
    is_chat_model=True,
    api_key="my-key",
    api_base="https://vllm-host/api/v1",   
)

reader = SimpleDirectoryReader("./data/paul_graham/")
documents = reader.load_data(
    show_progress=True
)
data_generator = DatasetGenerator.from_documents(
    documents=documents,
    llm=llm,
    num_questions_per_chunk=10,
)

eval_questions = data_generator.generate_questions_from_nodes()

Output sample (353 questions):

['Sure, here are 10 questions based on the provided context information:',
 'In the essay, what does Paul Graham describe as his first exposure to programming?',
 "What type of computer did Paul Graham's father buy in 1980?",
 "What was Paul Graham's initial plan for his college studies?",
...

However, if I run this the new class method RagDatasetGenerator, I don' get any questions:

%%time
nest_asyncio.apply()

llm = OpenAILike(
    model="meta-llama/Llama-2-70b-chat-hf",
    is_chat_model=True,
    api_key="18d2b13a-17ab-4022-b469-5c996fd8cad3",
    api_base="https://vllm.libra.decc.vmware.com/api/v1",   
)

reader = SimpleDirectoryReader("./data/paul_graham/")
documents = reader.load_data(
    show_progress=True
)
data_generator = RagDatasetGenerator.from_documents(
    documents=documents,
    llm=llm,
    num_questions_per_chunk=10,
)

eval_questions = data_generator.generate_questions_from_nodes()

Output:

es, theory and systems. The theory people prove things, and the systems people build things. I wanted to build things.'], reference_answer='', reference_answer_by=None), LabelledRagDataExample(query='What was the name of the programming language that the author learned to teach himself AI?', query_by=CreatedBy(model_name='meta-llama/Llama-2-70b-chat-hf', type=<CreatedByType.AI: 'ai'>), reference_contexts=['I couldn\'t have put this into words when I was 18. All I knew at the time was that I kept taking philosophy courses and they kept being boring. So I decided to switch to AI.\n\nAI was in the air in the mid 1980s, but there were two things especially that made me want to work on it: a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. I haven\'t tried rereading The Moon is a Harsh Mistress, ...

Thanks in advance for your help!

Relevant Logs/Tracbacks

ALREADY PROVIDED IN THE PREVIOUS BOX
dosubot[bot] commented 7 months ago

🤖

Hey @vecorro! Great to see you back here, diving into the depths of question generation 🤿

Based on the information you've provided and the context from the LlamaIndex repository, it seems that the RagDatasetGenerator and the deprecated DatasetGenerator use different methods to generate questions.

The RagDatasetGenerator uses a language model (LLM) to generate questions based on the given documents. It does this by creating a query_engine from the SummaryIndex of each document, and then using this query_engine to generate questions based on the question_gen_query provided. This question_gen_query is a prompt that instructs the LLM to generate questions. By default, it is set to:

"You are a Teacher/Professor. Your task is to setup {num_questions_per_chunk} questions for an upcoming quiz/examination. The questions should be diverse in nature across the document. Restrict the questions to the context information provided."

This is different from the deprecated DatasetGenerator, which may have used a different method to generate questions. Without the code or description of the deprecated DatasetGenerator, it's hard to provide a more specific comparison. However, the key difference is likely in the method of question generation, with the RagDatasetGenerator using a language model and the DatasetGenerator using a different approach.

You can find more details about this in the llama-index-core/llama_index/core/llama_dataset/generator.py file.

As for why you're not getting questions with RagDatasetGenerator, it's hard to say without more information. It could be due to the specific documents you're using, the settings of the LLM, or other factors. If you could provide more details about the documents and the settings of the LLM, it might be easier to diagnose the issue.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
aniamer commented 1 month ago

What was the resolution? I can't find a reference to one @vecorro.