microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
18.9k stars 1.85k forks source link

[Issue]: The project covariates not exist in the built file #580

Closed shaoqing404 closed 3 months ago

shaoqing404 commented 3 months ago

Describe the issue

I'm getting an error when creating a local search using the official use case,it tells me that I can't find this covariate file. 1.The covariates project was not found in the built graph. Can I remove it in the local searcher? 2.How should I reconstruct this covariate in my graph?

Steps to reproduce

just do it, and you can see stats.json: { "total_runtime": 1150.067667722702, "num_documents": 1, "input_load_time": 0, "workflows": { "create_base_text_units": { "overall": 0.20503902435302734, "0_orderby": 0.002000093460083008, "1_zip": 0.0009996891021728516, "2_aggregate_override": 0.0030007362365722656, "3_chunk": 0.16653060913085938, "4_select": 0.0019986629486083984, "5_unroll": 0.0030002593994140625, "6_rename": 0.0019996166229248047, "7_genid": 0.00850677490234375, "8_unzip": 0.0019998550415039062, "9_copy": 0.003000974655151367, "10_filter": 0.010001420974731445 }, "create_base_extracted_entities": { "overall": 782.4712069034576, "0_entity_extract": 782.0635554790497, "1_merge_graphs": 0.40564393997192383 }, "create_summarized_entities": { "overall": 191.44460487365723, "0_summarize_descriptions": 191.44160509109497 }, "create_base_entity_graph": { "overall": 1.6647131443023682, "0_cluster_graph": 1.6567118167877197, "1_select": 0.0040013790130615234 }, "create_final_entities": { "overall": 18.11626625061035, "0_unpack_graph": 0.7127723693847656, "1_rename": 0.003999948501586914, "2_select": 0.005509853363037109, "3_dedupe": 0.004999637603759766, "4_rename": 0.00400090217590332, "5_filter": 0.020002126693725586, "6_text_split": 0.023024320602416992, "7_drop": 0.005999088287353516, "8_merge": 0.1520519256591797, "9_text_embed": 17.15132212638855, "10_drop": 0.0059986114501953125, "11_filter": 0.021512985229492188 }, "create_final_nodes": { "overall": 4.684857368469238, "0_layout_graph": 2.5701215267181396, "1_unpack_graph": 0.9923768043518066, "2_unpack_graph": 0.9953372478485107, "3_filter": 0.04850602149963379, "4_drop": 0.008002042770385742, "5_select": 0.005998134613037109, "6_rename": 0.0070002079010009766, "7_join": 0.015513420104980469, "8_convert": 0.02905893325805664, "9_rename": 0.006943464279174805 }, "create_final_communities": { "overall": 2.9321365356445312, "0_unpack_graph": 0.9077630043029785, "1_unpack_graph": 1.0635173320770264, "2_aggregate_override": 0.008929252624511719, "3_join": 0.03451657295227051, "4_join": 0.03902149200439453, "5_concat": 0.013000726699829102, "6_filter": 0.711329460144043, "7_aggregate_override": 0.03801727294921875, "8_join": 0.011513948440551758, "9_filter": 0.02650737762451172, "10_fill": 0.008999109268188477, "11_merge": 0.03851604461669922, "12_copy": 0.01099538803100586, "13_select": 0.009002447128295898 }, "join_text_units_to_entity_ids": { "overall": 0.05650734901428223, "0_select": 0.009998798370361328, "1_unroll": 0.01150822639465332, "2_aggregate_override": 0.026000499725341797 }, "create_final_relationships": { "overall": 0.990393877029419, "0_unpack_graph": 0.7843437194824219, "1_filter": 0.057015419006347656, "2_rename": 0.01050710678100586, "3_filter": 0.07201361656188965, "4_drop": 0.010999917984008789, "5_compute_edge_combined_degree": 0.013000249862670898, "6_convert": 0.021511316299438477, "7_convert": 0.012001991271972656 }, "join_text_units_to_relationship_ids": { "overall": 0.06950807571411133, "0_select": 0.01100015640258789, "1_unroll": 0.012508153915405273, "2_aggregate_override": 0.021997690200805664, "3_select": 0.013003349304199219 }, "create_final_community_reports": { "overall": 144.23982334136963, "0_prepare_community_reports_nodes": 0.05352377891540527, "1_prepare_community_reports_edges": 0.030506372451782227, "2_restore_community_hierarchy": 0.03902792930603027, "3_prepare_community_reports": 0.9143862724304199, "4_create_community_reports": 143.17837524414062, "5_window": 0.013002157211303711 }, "create_final_text_units": { "overall": 0.09754467010498047, "0_select": 0.012004613876342773, "1_rename": 0.012516975402832031, "2_join": 0.01651144027709961, "3_join": 0.016002178192138672, "4_aggregate_override": 0.014999866485595703, "5_select": 0.013511419296264648 }, "create_base_documents": { "overall": 0.1560194492340088, "0_unroll": 0.02499985694885254, "1_select": 0.01399993896484375, "2_rename": 0.013506650924682617, "3_join": 0.016003847122192383, "4_aggregate_override": 0.014995098114013672, "5_join": 0.01651144027709961, "6_rename": 0.012999534606933594, "7_convert": 0.029500961303710938 }, "create_final_documents": { "overall": 0.03250551223754883, "0_rename": 0.018505573272705078 } } } 微信截图_20240716153253

GraphRAG Config Used

encoding_model: cl100k_base skip_workflows: [] llm:

api_key: type: openai_chat # or azure_openai_chat model: deepseek-chat model_supports_json: false # recommended if this is available for your model.

max_tokens: 4000

request_timeout: 180.0

api_base: https://api.smnet.asia/v1

api_base: https://api.deepseek.com/v1

api_version: 2024-02-15-preview

organization:

deployment_name:

tokens_per_minute: 150_000 # set a leaky bucket throttle

requests_per_minute: 10_000 # set a leaky bucket throttle

max_retries: 10

max_retry_wait: 10.0

sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times

concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization: stagger: 0.3

num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:

parallelization: override the global parallelization settings for embeddings

async_mode: threaded # or asyncio llm: api_key: type: openai_embedding # or azure_openai_embedding model: embedding-2 api_base: https://open.bigmodel.cn/api/paas/v4

api_version: 2024-02-15-preview

# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10

max_retry_wait: 10.0

# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made

batch_size: 16 # the number of documents to send in a single request

# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional

chunks: size: 300 overlap: 100 group_by_columns: [id] # by default, we don't allow chunks to cross documents

input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\.txt$"

cache: type: file # or blob base_dir: "cache"

connection_string:

container_name:

storage: type: file # or blob base_dir: "output/${timestamp}/artifacts"

connection_string:

container_name:

reporting: type: file # or console, blob base_dir: "output/${timestamp}/reports"

connection_string:

container_name:

entity_extraction:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/entity_extraction.txt" entity_types: [organization,person,geo,event] max_gleanings: 0

summarize_descriptions:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/summarize_descriptions.txt" max_length: 500

claim_extraction:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

enabled: true

prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 0

community_report:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/community_report.txt" max_length: 2000 max_input_length: 8000

cluster_graph: max_cluster_size: 10

embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes

num_walks: 10

walk_length: 40

window_size: 2

iterations: 3

random_seed: 597832

umap: enabled: false # if true, will generate UMAP embeddings for nodes

snapshots: graphml: false raw_entities: false top_level_nodes: false

local_search: text_unit_prop: 0.5 community_prop: 0.1 conversation_history_max_turns: 5 top_k_mapped_entities: 10 top_k_relationships: 10 max_tokens: 12000

global_search: max_tokens: 12000 data_max_tokens: 12000 map_max_tokens: 1000 reduce_max_tokens: 2000 concurrency: 32

Logs and screenshots

No response

Additional Information

nikhilmaddirala commented 3 months ago

I'm facing the same issue. After indexing, local search is looking for "create_final_covariates" but this does not exist.

darkfennertrader commented 3 months ago

same here....

IdaWoods commented 3 months ago

There is a 'claim-extraction:' in 'settings. yaml'. Change the value of 'enabled' to true or remove the comment on this line to generate the file 'creat_final_comvariates.parquet'

natoverse commented 3 months ago

As @IdaWoods notes, you can optionally turn on covariates. We leave them off by default because they tend to take quite a bit of prompt tuning. Search should ignore if covariates are missing.

We also have a consolidated issue for non-OpenAI/Azure models here: #657. Often these sorts of errors are a red herring due to some malformed response from the model.

nikhilmaddirala commented 3 months ago

As @IdaWoods notes, you can optionally turn on covariates. We leave them off by default because they tend to take quite a bit of prompt tuning. Search should ignore if covariates are missing.

We also have a consolidated issue for non-OpenAI/Azure models here: #657. Often these sorts of errors are a red herring due to some malformed response from the model.

I think the solution is to improve the documentation for settings.yaml.

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

My interpretation of this code is that claim_extraction is enabled by default. It would be good to clearly specify the defaults for each setting and what to do with each setting.

shaoqing404 commented 3 months ago

请注意,您可以选择性地打开协变量。默认情况下,我们将它们关闭,因为它们往往需要相当多的提示调整。如果缺少协变量,则搜索应忽略。 我们还在此处针对非 OpenAI/Azure 模型的合并问题:#657。通常,由于模型的某些畸形响应,这些类型的错误是一条红鲱鱼。

我认为解决方案是改进settings.yaml的文档。

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

我对此代码的解释是,默认情况下claim_extraction处于启用状态。最好清楚地指定每个设置的默认值以及如何处理每个设置。

请注意,您可以选择性地打开协变量。默认情况下,我们将它们关闭,因为它们往往需要相当多的提示调整。如果缺少协变量,则搜索应忽略。 我们还在此处针对非 OpenAI/Azure 模型的合并问题:#657。通常,由于模型的某些畸形响应,这些类型的错误是一条红鲱鱼。

我认为解决方案是改进settings.yaml的文档。

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

我对此代码的解释是,默认情况下claim_extraction处于启用状态。最好清楚地指定每个设置的默认值以及如何处理每个设置。

""" 感谢各位的回复。我观察到在中文环境中打开会引起效果的崩溃,因此如果使用中文文档,几乎必须将它关闭。 在中文社区当中已经有开发者准备提交中文分词器来改善这一情况,你觉得它应该作为建议提供给graphrag的开发者吗? """

Thanks for your responses. I've observed that opening in a Chinese environment causes the effect to crash, so if using Chinese documentation, it's almost necessary to close it. There are already developers in the Chinese community who are ready to submit Chinese word segmentation to improve this situation. Do you think it should be offered as a suggestion to graphrag developers?