microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
19.24k stars 1.9k forks source link

[Bug]: datashaper.workflow.workflow ERROR Error executing verb "join" in create_final_communities: Unable to allocate 3.14 GiB for an array with shape (13, 32417216) and data type object #1208

Closed Worleyyy closed 3 weeks ago

Worleyyy commented 1 month ago

Do you need to file an issue?

Describe the bug

the pipeline terminates with an error in engine_logs as unable to allocate memory .

data size: 69MB number of txt files: 29k

image

Steps to reproduce

No response

Expected Behavior

when I executed graphrag with same config for 5000 .txt files it worked smoothly, also ran query and got expected answers. but when the number of .txt files are more around 29k with total size 69 MB the pipeline stops during final community creation with error as unable to allocate 3.1 GB memory.

First I got malloc or realloc error for 2.2 GB at the create summarize entities step so I increased the RAM from 16GB to 32GB. then create summarize entities step was completed without any errors but further it gave this error unable to allocate 3.1 GB at create final communities step

image

GraphRAG Config Used

# config

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: GRAPH_RAG_API_KEY
  type: azure_openai_chat
  model: gpt-4o
  model_supports_json: true # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  api_base: https://openai01.openai.azure.com/
  api_version: 2024-05-01-preview
  # organization: <organization_id>
  deployment_name: GPT-4o
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  max_retries: 50
  # max_retry_wait: 10.0
  sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  # target: required # or all
  batch_size: 16 # the number of documents to send in a single request
  batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
  llm:
    api_key: GRAPH_RAG_API_KEY
    type: azure_openai_embedding
    model: gpt-4o
    api_base: https://openai01.openai.azure.com/
    api_version: 2024-05-01-preview
    # organization: <organization_id>
    deployment_name: text-embedding-ada-002
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    max_retries: 100
    max_retry_wait: 15.0
    sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made

chunks:
  size: 600
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## strategy: fully override the entity extraction strategy.
  ##   type: one of graph_intelligence, graph_intelligence_json and nltk
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

end part of indexing-engine_logs

10:16:23,745 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_entities.parquet 10:18:49,703 graphrag.index.run.workflow INFO dependencies for create_final_nodes: ['create_base_entity_graph'] 10:18:49,734 graphrag.utils.storage INFO read table from storage: create_base_entity_graph.parquet 10:19:08,911 datashaper.workflow.workflow INFO executing verb layout_graph 10:43:07,639 datashaper.workflow.workflow INFO executing verb unpack_graph 10:50:55,233 datashaper.workflow.workflow INFO executing verb unpack_graph 10:58:54,440 datashaper.workflow.workflow INFO executing verb drop 10:58:55,85 datashaper.workflow.workflow INFO executing verb filter 10:59:26,549 datashaper.workflow.workflow INFO executing verb select 10:59:26,593 datashaper.workflow.workflow INFO executing verb rename 10:59:26,627 datashaper.workflow.workflow INFO executing verb convert 10:59:26,947 datashaper.workflow.workflow INFO executing verb join 10:59:36,546 datashaper.workflow.workflow INFO executing verb rename 10:59:39,285 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_nodes.parquet 11:00:15,242 graphrag.index.run.workflow INFO dependencies for create_final_communities: ['create_base_entity_graph'] 11:00:15,242 graphrag.utils.storage INFO read table from storage: create_base_entity_graph.parquet 11:00:35,614 datashaper.workflow.workflow INFO executing verb unpack_graph 11:08:01,576 datashaper.workflow.workflow INFO executing verb unpack_graph 11:16:02,458 datashaper.workflow.workflow INFO executing verb aggregate_override 11:16:03,833 datashaper.workflow.workflow INFO executing verb join 11:20:15,591 datashaper.workflow.workflow INFO executing verb join 11:22:18,181 datashaper.workflow.workflow ERROR Error executing verb "join" in create_final_communities: Unable to allocate 3.14 GiB for an array with shape (13, 32417216) and data type object Traceback (most recent call last): File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb result = node.verb.func(**verb_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\datashaper\engine\verbs\join.py", line 83, in join return create_verb_result(clean_result(join_strategy, output, input_table)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\datashaper\engine\verbs\join.py", line 41, in clean_result result[result["_merge"] == "both"],


  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\pandas\core\frame.py", line 4093, in __getitem__
    return self._getitem_bool_array(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\pandas\core\frame.py", line 4152, in _getitem_bool_array
    return self.copy(deep=None)
           ^^^^^^^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\pandas\core\generic.py", line 6811, in copy
    data = self._mgr.copy(deep=deep)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\pandas\core\internals\managers.py", line 604, in copy
    res._consolidate_inplace()
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\pandas\core\internals\managers.py", line 1788, in _consolidate_inplace
    self.blocks = _consolidate(self.blocks)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\pandas\core\internals\managers.py", line 2269, in _consolidate
    merged_blocks, _ = _merge_blocks(
                       ^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\pandas\core\internals\managers.py", line 2294, in _merge_blocks
    new_values = np.vstack([b.values for b in blocks])  # type: ignore[misc]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\numpy\core\shape_base.py", line 289, in vstack
    return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 3.14 GiB for an array with shape (13, 32417216) and data type object
11:22:18,478 graphrag.index.reporting.file_workflow_callbacks INFO Error executing verb "join" in create_final_communities: Unable to allocate 3.14 GiB for an array with shape (13, 32417216) and data type object details=None
11:22:18,509 graphrag.index.run.run ERROR error running workflow create_final_communities
Traceback (most recent call last):
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\graphrag\index\run\run.py", line 225, in run_pipeline
    result = await _process_workflow(
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\graphrag\index\run\workflow.py", line 91, in _process_workflow
    result = await workflow.run(context, callbacks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\datashaper\workflow\workflow.py", line 369, in run
    timing = await self._execute_verb(node, context, callbacks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb
    result = node.verb.func(**verb_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\datashaper\engine\verbs\join.py", line 83, in join
    return create_verb_result(__clean_result(join_strategy, output, input_table))
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\datashaper\engine\verbs\join.py", line 41, in __clean_result
    result[result["_merge"] == "both"],
    ~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\pandas\core\frame.py", line 4093, in __getitem__
    return self._getitem_bool_array(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\pandas\core\frame.py", line 4152, in _getitem_bool_array
    return self.copy(deep=None)
           ^^^^^^^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\pandas\core\generic.py", line 6811, in copy
    data = self._mgr.copy(deep=deep)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\pandas\core\internals\managers.py", line 604, in copy
    res._consolidate_inplace()
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\pandas\core\internals\managers.py", line 1788, in _consolidate_inplace
    self.blocks = _consolidate(self.blocks)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\pandas\core\internals\managers.py", line 2269, in _consolidate
    merged_blocks, _ = _merge_blocks(
                       ^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\pandas\core\internals\managers.py", line 2294, in _merge_blocks
    new_values = np.vstack([b.values for b in blocks])  # type: ignore[misc]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Chaitanya\Downloads\ccus_env\Lib\site-packages\numpy\core\shape_base.py", line 289, in vstack
    return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 3.14 GiB for an array with shape (13, 32417216) and data type object
11:22:18,572 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None
11:22:22,760 graphrag.index.cli ERROR Errors occurred during the pipeline run, see logs for more details.

## stats.json

{
    "total_runtime": 11186.068560123444,
    "num_documents": 29608,
    "input_load_time": 0,
    "workflows": {
        "create_base_text_units": {
            "overall": 29.57318639755249,
            "0_orderby": 0.041030168533325195,
            "1_zip": 0.04615473747253418,
            "2_aggregate_override": 0.40440821647644043,
            "3_chunk": 27.0225191116333,
            "4_select": 0.008150577545166016,
            "5_unroll": 0.026068687438964844,
            "6_rename": 0.008663177490234375,
            "7_genid": 1.273604393005371,
            "8_unzip": 0.05586576461791992,
            "9_copy": 0.008645772933959961,
            "10_filter": 0.6770839691162109
        },
        "create_base_extracted_entities": {
            "overall": 382.0234453678131,
            "0_entity_extract": 264.9716601371765,
            "1_merge_graphs": 117.02942085266113
        },
        "create_summarized_entities": {
            "overall": 717.464183807373,
            "0_summarize_descriptions": 717.464183807373
        },
        "create_base_entity_graph": {
            "overall": 1041.526780128479,
            "0_cluster_graph": 1041.444923877716,
            "1_select": 0.010102510452270508
        },
        "create_final_entities": {
            "overall": 6295.426899194717,
            "0_unpack_graph": 416.3113203048706,
            "1_rename": 0.5346510410308838,
            "2_select": 0.501063346862793,
            "3_dedupe": 0.623976469039917,
            "4_rename": 0.07079553604125977,
            "5_filter": 3.379973888397217,
            "6_text_split": 3.0819966793060303,
            "7_drop": 0.08867287635803223,
            "8_merge": 41.783282995224,
            "9_text_embed": 5824.788746356964,
            "10_drop": 0.07913398742675781,
            "11_filter": 4.167661666870117
        },
        "create_final_nodes": {
            "overall": 2430.296574115753,
            "0_layout_graph": 1438.7281467914581,
            "1_unpack_graph": 467.5943694114685,
            "2_unpack_graph": 479.2062978744507,
            "3_drop": 0.6452786922454834,
            "4_filter": 31.464421033859253,
            "5_select": 0.04390978813171387,
            "6_rename": 0.023688316345214844,
            "7_convert": 0.3285641670227051,
            "8_join": 9.59571361541748,
            "9_rename": 2.6495587825775146
        }
    }
}

### Additional Information

- GraphRAG Version: 0.3.4
- Operating System: Windows 10
- Python Version: 3.11.5
- Related Issues:
jgbradley1 commented 1 month ago

Looks like the memory issue occurs in a join call, which traditionally leads to an explosion in memory, regardless of what programming language you’re in.

We are tracking a couple of places within the indexing pipeline where these type of memory issues occur and are refactoring parts of the code to help improve the situation.

Worleyyy commented 1 month ago

I faced this issue at the create final entities stage and create final community stage. I had around 29k pages of data the total size of the input .txt files was just 70 MB still it took around 42 to 50GB RAM to complete the whole indexing pipeline. As of now i was able to solve this issue by increasing the RAM from 32 to 128 GB. But the data can be far larger than what i had, then it might not be possible to increase the RAM further.

github-actions[bot] commented 4 weeks ago

This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.

github-actions[bot] commented 3 weeks ago

This issue has been closed after being marked as stale for five days. Please reopen if needed.