run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.68k stars 5.25k forks source link

[Bug]: data_generator.generate_questions_from_nodes() does not generate questions. Progress bar does not move #12859

Closed Luchao2012 closed 3 months ago

Luchao2012 commented 7 months ago

Bug Description

data_generator.generate_questions_from_nodes() does not generate questions. Progress bar does not move

Version

0.10.27

Steps to Reproduce

llm = AzureOpenAI( model="gpt-35-turbo", deployment_name="test-gpt-35-turbo", api_key=api_key, azure_endpoint=azure_endpoint, api_version=api_version, )

You need to deploy your own embedding model as well as your own chat completion model

embed_model = AzureOpenAIEmbedding( model="text-embedding-ada-002", deployment_name="test-text-embedding-ada-002", api_key=api_key, azure_endpoint=azure_endpoint, api_version=api_version, )

from llama_index.core import Settings

Settings.llm = llm Settings.embed_model = embed_model

reader = SimpleDirectoryReader( input_dir="../../reports_reservoir_sub", recursive=True, )

docs = reader.load_data(num_workers=4) # Load the documents from the directory

from llama_index.core.evaluation import DatasetGenerator

data_generator = DatasetGenerator.from_documents(docs, num_questions_per_chunk=1, llm=llm, show_progress=True)

eval_questions = data_generator.generate_questions_from_nodes()

The code stops at the end and does not progress.

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 7 months ago

Hey @Luchao2012, great to see you tackling new challenges with LlamaIndex! 👋 It's like we're on a tech adventure together. I'm diving into your issue right now and will be back with insights shortly. Stay tuned! 🕵️‍♂️

logan-markewich commented 7 months ago

Maybe confirm your llm works

print(llm.complete("hello world"))

If it works, try hitting ctrl-c when it's "stuck" and see where it is in the code from the traceback

Luchao2012 commented 7 months ago

print(llm.complete("hello world")) shows: Hello! How can I assist you today?

All the code works until the last line of code where is "stuck": eval_questions = data_generator.generate_questions_from_nodes()

print(llm.complete("hello world")) shows: Hello! How can I assist you today?

All the code works until the last line of code where is "stuck": eval_questions = data_generator.generate_questions_from_nodes()

dosubot[bot] commented 7 months ago

To address the issue where eval_questions = data_generator.generate_questions_from_nodes() hangs without progress, consider the following steps:

  1. Check for Asyncio Loop Conflicts: If your application already has an event loop running, use agenerate_questions_from_nodes() within your async function instead of generate_questions_from_nodes().

  2. Verify Azure Service Configuration: Ensure that your AzureOpenAI and AzureOpenAIEmbedding configurations are correct, including credentials and endpoints.

  3. Inspect API Rate Limits: You might be hitting Azure's rate limits. Check the Azure dashboard for any rate limit warnings.

  4. Debug Async Jobs Runner: Investigate the run_jobs function for issues in task dispatch or completion, especially if it's related to the progress bar or async task management.

  5. Validate Input Nodes: Make sure the data passed to generate_questions_from_nodes() is correctly formatted and valid.

  6. Handle Exceptions in Async Tasks: Add exception handling around async tasks to catch and log errors that might not be visible.

  7. Check Model Compatibility: Confirm that the models gpt-3.5-turbo and text-embedding-ada-002 are being used correctly according to their input and output requirements.

  8. Update Libraries: Ensure all dependencies, including Azure SDKs, are up to date.

  9. Enable Detailed Logging: Increase logging levels to gain more insight into the operation of the function and Azure services.

  10. Simplify and Manually Test: Try a simplified manual test with the Azure AI services to isolate the issue.

Regarding the internal error handling of generate_questions_from_nodes(), it primarily relies on the async function it wraps. Any exceptions occurring in the async tasks need to be caught by the caller or through global exception handling in your application.

These steps should help you troubleshoot and resolve the issue with the function hanging. If the problem persists, consider reaching out to Azure support with detailed logs and a description of your troubleshooting steps.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
Luchao2012 commented 7 months ago

I changed it to RagDatasetGenerator from llama_index.core.llama_dataset.generator import RagDatasetGenerator dataset_generator = RagDatasetGenerator.from_documents(docs, num_questions_per_chunk=1, llm=llm, show_progress=True) rag_dataset = dataset_generator.generate_dataset_from_nodes()

There is progress now. What is the reason that the DatasetGenerator does not progress with all the other code are the same.

logan-markewich commented 7 months ago

@Luchao2012 and if you hit ctrl+c (or kill/interrupt the program), what's the traceback?

It works fine for me locally, so just trying to figure out where the issue is

If you are running in a notebook (or some other async function), maybe try the async version? eval_questions = await data_generator.agenerate_questions_from_nodes()