microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
20.07k stars 1.96k forks source link

[Bug]: skip_workflows is not working #793

Closed CHENGLING-QIU closed 4 months ago

CHENGLING-QIU commented 4 months ago

Do you need to file an issue?

Describe the bug

Skip workflow skip_workflows doesn't work as expected. Missing "strategy" input for some internal function.

I hope to skip create_base_text_units & create_base_extracted_entities as they are completed from the previous setup. Since the file is very large in size, this took ~5 hrs and I have I dont have to rerun the process.

However, the skip_workflows setting is not working at all.

The other dependency setting should be correctly setup as I have run the pipeline several times on smaller text inputs.

Steps to reproduce

Put the verb name in the skip_workflows setting in the setting.yaml

skip_workflows: [create_base_text_units, create_base_extracted_entities]

Expected Behavior

Skip the workflows and read parquet directly to continue the downstream verb.

GraphRAG Config Used

encoding_model: cl100k_base skip_workflows: [create_base_text_units, create_base_extracted_entities]

Logs and screenshots

09:17:15,803 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_base_text_units.parquet 09:17:16,818 graphrag.index.run INFO Running workflow: create_base_extracted_entities... 09:17:16,818 graphrag.index.run INFO dependencies for create_base_extracted_entities: ['create_base_text_units'] 09:17:16,819 graphrag.index.run INFO read table from storage: create_base_text_units.parquet 09:17:17,311 datashaper.workflow.workflow INFO executing verb entity_extract 09:17:17,311 datashaper.workflow.workflow ERROR Error executing verb "entity_extract" in create_base_extracted_entities: entity_extract() missing 1 required positional argument: 'strategy' Traceback (most recent call last): File "path_to_your_project_site_packages\datashaper\workflow\workflow.py", line 410, in _execute_verb result = node.verb.func(verb_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: entity_extract() missing 1 required positional argument: 'strategy' 09:17:17,321 graphrag.index.reporting.file_workflow_callbacks INFO Error executing verb "entity_extract" in create_base_extracted_entities: entity_extract() missing 1 required positional argument: 'strategy' details=None 09:17:17,321 graphrag.index.run ERROR error running workflow create_base_extracted_entities Traceback (most recent call last): File "path_to_your_project_site_packages\graphrag\index\run.py", line 323, in run_pipeline result = await workflow.run(context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "path_to_your_project_site_packages\datashaper\workflow\workflow.py", line 369, in run timing = await self._execute_verb(node, context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "path_to_your_project_site_packages\datashaper\workflow\workflow.py", line 410, in _execute_verb result = node.verb.func(verb_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: entity_extract() missing 1 required positional argument: 'strategy' 09:17:17,340 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None

Additional Information

9prodhi commented 4 months ago

The default data pipeline proceeds in the following order:

# Default Data Pipeline
pipeline = [
    'create_base_text_units',
    'create_base_extracted_entities',
    'create_summarized_entities',
    'create_base_entity_graph',
    'create_final_entities',
    'create_final_nodes',
    'create_final_communities',
    'join_text_units_to_entity_ids',
    'create_final_relationships',
    'join_text_units_to_relationship_ids',
    'create_final_community_reports',
    'create_final_text_units',
    'create_base_documents',
    'create_final_documents'
]

The create_summarized_entities step (3rd in the pipeline) uses the output from the previous step (create_base_extracted_entities) as input. Here's the relevant code snippet:

return [
    {
        "verb": "summarize_descriptions",
        "args": {
            **summarize_descriptions_config,
            "column": "entity_graph",
            "to": "entity_graph",
            "async_mode": summarize_descriptions_config.get(
                "async_mode", AsyncType.AsyncIO
            ),
        },
        "input": {"source": "workflow:create_base_extracted_entities"},
    },
    {
        "verb": "snapshot_rows",
        "enabled": graphml_snapshot_enabled,
        "args": {
            "base_name": "summarized_graph",
            "column": "entity_graph",
            "formats": [{"format": "text", "extension": "graphml"}],
        },
    },
]

If any steps are skipped during processing, it may lead to missing outputs, potentially causing issues. This conclusion is drawn from a preliminary analysis.

CHENGLING-QIU commented 4 months ago

@9prodhi Thank you for the reply. My original understanding is: if I have finished 'create_base_text_units', 'create_base_extracted_entities', from previous run, with the output files generated, but encountered some error in create_summarized_entities: then I should be able to skip these two verbs for the new trial and continue from the 3rd verb.

The ask is because the first two verbs took me ~12hrs to process.

I just want to bring it up, and as you shared, it seems you do have to rerun the pipeline again from the beginning. so the skip_workflows cannot help in this case.

9prodhi commented 4 months ago

The skip_workflows setting may have other use cases, but if your goal is to resume processing the pipeline from where you left off, you should consider using the resume command with the name of the inprogress directory as a parameter.

Here’s a sample command:


python -m graphrag.index --root ./b793 --resume "20240801-180420"
CHENGLING-QIU commented 4 months ago

Thank you, truly helpful!