Closed CHENGLING-QIU closed 4 months ago
The default data pipeline proceeds in the following order:
# Default Data Pipeline
pipeline = [
'create_base_text_units',
'create_base_extracted_entities',
'create_summarized_entities',
'create_base_entity_graph',
'create_final_entities',
'create_final_nodes',
'create_final_communities',
'join_text_units_to_entity_ids',
'create_final_relationships',
'join_text_units_to_relationship_ids',
'create_final_community_reports',
'create_final_text_units',
'create_base_documents',
'create_final_documents'
]
The create_summarized_entities step (3rd in the pipeline) uses the output from the previous step (create_base_extracted_entities) as input. Here's the relevant code snippet:
return [
{
"verb": "summarize_descriptions",
"args": {
**summarize_descriptions_config,
"column": "entity_graph",
"to": "entity_graph",
"async_mode": summarize_descriptions_config.get(
"async_mode", AsyncType.AsyncIO
),
},
"input": {"source": "workflow:create_base_extracted_entities"},
},
{
"verb": "snapshot_rows",
"enabled": graphml_snapshot_enabled,
"args": {
"base_name": "summarized_graph",
"column": "entity_graph",
"formats": [{"format": "text", "extension": "graphml"}],
},
},
]
If any steps are skipped during processing, it may lead to missing outputs, potentially causing issues. This conclusion is drawn from a preliminary analysis.
@9prodhi Thank you for the reply. My original understanding is: if I have finished 'create_base_text_units', 'create_base_extracted_entities', from previous run, with the output files generated, but encountered some error in create_summarized_entities: then I should be able to skip these two verbs for the new trial and continue from the 3rd verb.
The ask is because the first two verbs took me ~12hrs to process.
I just want to bring it up, and as you shared, it seems you do have to rerun the pipeline again from the beginning. so the skip_workflows cannot help in this case.
The skip_workflows
setting may have other use cases, but if your goal is to resume processing the pipeline from where you left off, you should consider using the resume
command with the name of the inprogress
directory as a parameter.
Here’s a sample command:
python -m graphrag.index --root ./b793 --resume "20240801-180420"
Thank you, truly helpful!
Do you need to file an issue?
Describe the bug
Skip workflow skip_workflows doesn't work as expected. Missing "strategy" input for some internal function.
I hope to skip create_base_text_units & create_base_extracted_entities as they are completed from the previous setup. Since the file is very large in size, this took ~5 hrs and I have I dont have to rerun the process.
However, the skip_workflows setting is not working at all.
The other dependency setting should be correctly setup as I have run the pipeline several times on smaller text inputs.
Steps to reproduce
Put the verb name in the skip_workflows setting in the setting.yaml
skip_workflows: [create_base_text_units, create_base_extracted_entities]
Expected Behavior
Skip the workflows and read parquet directly to continue the downstream verb.
GraphRAG Config Used
encoding_model: cl100k_base skip_workflows: [create_base_text_units, create_base_extracted_entities]
Logs and screenshots
09:17:15,803 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_base_text_units.parquet 09:17:16,818 graphrag.index.run INFO Running workflow: create_base_extracted_entities... 09:17:16,818 graphrag.index.run INFO dependencies for create_base_extracted_entities: ['create_base_text_units'] 09:17:16,819 graphrag.index.run INFO read table from storage: create_base_text_units.parquet 09:17:17,311 datashaper.workflow.workflow INFO executing verb entity_extract 09:17:17,311 datashaper.workflow.workflow ERROR Error executing verb "entity_extract" in create_base_extracted_entities: entity_extract() missing 1 required positional argument: 'strategy' Traceback (most recent call last): File "path_to_your_project_site_packages\datashaper\workflow\workflow.py", line 410, in _execute_verb result = node.verb.func(verb_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: entity_extract() missing 1 required positional argument: 'strategy' 09:17:17,321 graphrag.index.reporting.file_workflow_callbacks INFO Error executing verb "entity_extract" in create_base_extracted_entities: entity_extract() missing 1 required positional argument: 'strategy' details=None 09:17:17,321 graphrag.index.run ERROR error running workflow create_base_extracted_entities Traceback (most recent call last): File "path_to_your_project_site_packages\graphrag\index\run.py", line 323, in run_pipeline result = await workflow.run(context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "path_to_your_project_site_packages\datashaper\workflow\workflow.py", line 369, in run timing = await self._execute_verb(node, context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "path_to_your_project_site_packages\datashaper\workflow\workflow.py", line 410, in _execute_verb result = node.verb.func(verb_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: entity_extract() missing 1 required positional argument: 'strategy' 09:17:17,340 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None
Additional Information