Closed Worleyyy closed 3 weeks ago
Looks like the memory issue occurs in a join call, which traditionally leads to an explosion in memory, regardless of what programming language you’re in.
We are tracking a couple of places within the indexing pipeline where these type of memory issues occur and are refactoring parts of the code to help improve the situation.
I faced this issue at the create final entities stage and create final community stage. I had around 29k pages of data the total size of the input .txt files was just 70 MB still it took around 42 to 50GB RAM to complete the whole indexing pipeline. As of now i was able to solve this issue by increasing the RAM from 32 to 128 GB. But the data can be far larger than what i had, then it might not be possible to increase the RAM further.
This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.
This issue has been closed after being marked as stale for five days. Please reopen if needed.
Do you need to file an issue?
Describe the bug
the pipeline terminates with an error in engine_logs as unable to allocate memory .
data size: 69MB number of txt files: 29k
Steps to reproduce
No response
Expected Behavior
when I executed graphrag with same config for 5000 .txt files it worked smoothly, also ran query and got expected answers. but when the number of .txt files are more around 29k with total size 69 MB the pipeline stops during final community creation with error as unable to allocate 3.1 GB memory.
First I got malloc or realloc error for 2.2 GB at the create summarize entities step so I increased the RAM from 16GB to 32GB. then create summarize entities step was completed without any errors but further it gave this error unable to allocate 3.1 GB at create final communities step
GraphRAG Config Used
Logs and screenshots
end part of indexing-engine_logs
10:16:23,745 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_entities.parquet 10:18:49,703 graphrag.index.run.workflow INFO dependencies for create_final_nodes: ['create_base_entity_graph'] 10:18:49,734 graphrag.utils.storage INFO read table from storage: create_base_entity_graph.parquet 10:19:08,911 datashaper.workflow.workflow INFO executing verb layout_graph 10:43:07,639 datashaper.workflow.workflow INFO executing verb unpack_graph 10:50:55,233 datashaper.workflow.workflow INFO executing verb unpack_graph 10:58:54,440 datashaper.workflow.workflow INFO executing verb drop 10:58:55,85 datashaper.workflow.workflow INFO executing verb filter 10:59:26,549 datashaper.workflow.workflow INFO executing verb select 10:59:26,593 datashaper.workflow.workflow INFO executing verb rename 10:59:26,627 datashaper.workflow.workflow INFO executing verb convert 10:59:26,947 datashaper.workflow.workflow INFO executing verb join 10:59:36,546 datashaper.workflow.workflow INFO executing verb rename 10:59:39,285 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_nodes.parquet 11:00:15,242 graphrag.index.run.workflow INFO dependencies for create_final_communities: ['create_base_entity_graph'] 11:00:15,242 graphrag.utils.storage INFO read table from storage: create_base_entity_graph.parquet 11:00:35,614 datashaper.workflow.workflow INFO executing verb unpack_graph 11:08:01,576 datashaper.workflow.workflow INFO executing verb unpack_graph 11:16:02,458 datashaper.workflow.workflow INFO executing verb aggregate_override 11:16:03,833 datashaper.workflow.workflow INFO executing verb join 11:20:15,591 datashaper.workflow.workflow INFO executing verb join 11:22:18,181 datashaper.workflow.workflow ERROR Error executing verb "join" in create_final_communities: Unable to allocate 3.14 GiB for an array with shape (13, 32417216) and data type object Traceback (most recent call last): File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb result = node.verb.func(**verb_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\datashaper\engine\verbs\join.py", line 83, in join return create_verb_result(clean_result(join_strategy, output, input_table)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\datashaper\engine\verbs\join.py", line 41, in clean_result result[result["_merge"] == "both"],