microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
20.04k stars 1.96k forks source link

[Issue]: create_final_community_reports error becase of loading json error. #804

Closed galen1980guo closed 3 months ago

galen1980guo commented 4 months ago

Is there an existing issue for this?

Describe the issue

When I deployed the LLM using Ollama and the embeddings using Xinference, the process of building the index failed at the step create_final_community_reports. According to the detailed logs, it seems that the issue occurred due to a failure in parsing JSON.

Steps to reproduce

No response

GraphRAG Config Used


encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ollama # ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: llama3:70b
  model_supports_json: false # recommended if this is available for your model.
  api_base: http://10.110.0.25:11434/v1
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: xinference
    type: openai_embedding # or azure_openai_embedding
    model: bce-embedding-base_v1
    api_base: http://10.110.0.25:9997/v1
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional

chunks:
  size: 300
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 0

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots


$ poetry run poe index --root .
Poe => python -m graphrag.index --root .
🚀 Reading settings from settings.yaml
/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. 
Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
🚀 create_base_text_units
                                  id                                              chunk                          chunk_id                        document_ids  n_tokens
0   e4f90b970ba24f1c441698ec39252396  上海近 40 度的高温,并没有阻止人们参会的热情——相反,7 月 4 日于上海举办的 202...  e4f90b970ba24f1c441698ec39252396  [55274aaec21254cba99632471c0033d7]     
300
1   05c2ac7be39bcf579b489d309ba23ebb  把手、还是展台前被 AI 裹挟的普通人,又都在说同一件事——AI 应用如何落地。\n这里没有...  05c2ac7be39bcf579b489d309ba23ebb  [55274aaec21254cba99632471c0033d7] 
300
2   3380e68249b1a8d8752f190053897e48  。大会的议程设置往往反映了行业的普遍趋势。除了大模型之外,具身智能、机器人、芯片等领域也延续...  3380e68249b1a8d8752f190053897e48  
[55274aaec21254cba99632471c0033d7]       300
3   236037d63b053cc5561d77c3cf97d89a  的投入等方面来看,我们对 AI安全的投入远远落后于对 AI 性能的投入。现在,世界上只有 1...  236037d63b053cc5561d77c3cf97d89a  
[55274aaec21254cba99632471c0033d7]       300
4   89d5196a8d134db82c434ca2f8e0022d  风险,比如说 AI非常强大,而且是可以有很多方式去使用,所以颠覆现在社会结构在短时间内发生的...  89d5196a8d134db82c434ca2f8e0022d  
[55274aaec21254cba99632471c0033d7]       300
5   7fcbb470dd11fba1f05cea93c0b4a025  给破坏了,这样权衡是非常困难的。正如图灵所说,这是无法预测的,预测不了机器有了足够算力之后会...  7fcbb470dd11fba1f05cea93c0b4a025  
[55274aaec21254cba99632471c0033d7]       300
6   34bfc36eaafebee552620a9da71711cc  的增益,整体价值就比移动互联网要大多了。\n智能体是最看好的 AI 应用方向,搜索是智能体分...  34bfc36eaafebee552620a9da71711cc  
[55274aaec21254cba99632471c0033d7]       300
7   45073e405f1ae1b24b50c25e1f6dab81  专有的问题。\n制作一个好的智能体通常并不需要编码,只要用人话把智能体的工作流说清楚,再配上...  45073e405f1ae1b24b50c25e1f6dab81  
[55274aaec21254cba99632471c0033d7]       300
8   43a82f160b03cd9edb694538c7404bfc  ��烈竞争的环境中,需要让业务效率比同行更高、成本比同行更低时,商业化的闭源模型是最能打的。...  43a82f160b03cd9edb694538c7404bfc  
[55274aaec21254cba99632471c0033d7]       300
9   ffdd22363ae3c156e72bcfe81764f2cc  不能够让你站在巨人的肩膀上去迭代和开发。\n\n傅盛- 猎豹移动董事长兼 CEO,\n猎户星...  ffdd22363ae3c156e72bcfe81764f2cc  [55274aaec21254cba99632471c0033d7]  
300
10  a98c26291b2387897b27b505d502fccb  工智能,我们叫智能涌现,其实对中间的原理并不是特别清楚,是一个灰盒状态。\n智能的涌现,可能...  a98c26291b2387897b27b505d502fccb  
[55274aaec21254cba99632471c0033d7]       300
11  7547fc209a7f053b453c16f9ce175bb0  人们发展的是什么?我们和 AI 不一样的地方在哪里呢?我有一个想法,那就是好奇心。\n我还想...  7547fc209a7f053b453c16f9ce175bb0  
[55274aaec21254cba99632471c0033d7]       300
12  43cdc28708e6b851c748d5a2bebcb2f5  初,发明了人工智能这个词的十个人之一——赫本山姆(Herbert A. Simon)跟我们讲...  43cdc28708e6b851c748d5a2bebcb2f5  [55274aaec21254cba99632471c0033d7]       
300
13  c0cff22a90d6fe2baad2f444c1152d19  ,我个人觉得有一点点混淆,翻译成普通人工智能会更加确切,它是一个最最基本的东西,而不是从通用...  c0cff22a90d6fe2baad2f444c1152d19  
[55274aaec21254cba99632471c0033d7]       300
14  d26a230cf29673de6aa58a557adbcbdc  企业也要意识到这是革命的工具,那这个变化就来了。\n\n张平安- 华为常务董事、华为云 CE...  d26a230cf29673de6aa58a557adbcbdc  
[55274aaec21254cba99632471c0033d7]       300
15  b1aa1aa2280607243e7d5c79fdaf6d5b  的这张图片里)蚂蚁绒毛清晰可见。\n在云端,通过云网端芯架构上的协同创新,来构建可持续发展的...  b1aa1aa2280607243e7d5c79fdaf6d5b  
[55274aaec21254cba99632471c0033d7]       300
16  c6b307e91937f72eb51fe1ec65d38108  杂决策难以胜任,以及对话交互不等于有效协同。\n通过专业智能体的深度连接,AI 会像互联网一...  c6b307e91937f72eb51fe1ec65d38108  
[55274aaec21254cba99632471c0033d7]       300
17  da8eab81e95c61cf1d5dd7137809c90d  业深度协作,需要很多的专业智能体共同参与、各司其职。蚂蚁坚持走开放道路,和行业共建专业智能体...  da8eab81e95c61cf1d5dd7137809c90d  
[55274aaec21254cba99632471c0033d7]       300
18  aed528e5d60172692c78a9d02ebfbdd6  ,我忽然感觉有点变化的想法。因为我的中学的退休的老师不停的在群里面问我,怎么样用人工智能去写...  aed528e5d60172692c78a9d02ebfbdd6  
[55274aaec21254cba99632471c0033d7]       300
19  e7d446b66b5ba7e0567f0032005cb96b  点:高质量数据、流畅的交互、可控性\n如果要推动人工智能超级时刻的到来,需要大模型可以展现出...  e7d446b66b5ba7e0567f0032005cb96b  
[55274aaec21254cba99632471c0033d7]       300
20  4e230251f89ebaeb16a495c1f737bc01  全自然的交互模式。\n第三,所有的生成都要可控,你不需要做得很好,但你需要知道你哪里做得不好...  4e230251f89ebaeb16a495c1f737bc01  
[55274aaec21254cba99632471c0033d7]       300
21  54455c58f6a327d5c8e24d1bbdcdcda6  作用,但如果将 20% 的生成式 AI 工作负载转移到终端侧,预计到 2028 年将节省 1...  54455c58f6a327d5c8e24d1bbdcdcda6  [55274aaec21254cba99632471c0033d7]       
300
22  bc6487902d0d4f6f2ed0db220e215a3a  使其体量越来越小,效率越来越高。\nIDC 预测,预计 2027 年中国新一代 AI 手机出...  bc6487902d0d4f6f2ed0db220e215a3a  [55274aaec21254cba99632471c0033d7]      
300
23  ffd76bd503d085b6c431e04247900683  有 30%、40% 的错误率。国内的模型整体有 60% 到 70% 的错误率。\n为什么大模...  ffd76bd503d085b6c431e04247900683  [55274aaec21254cba99632471c0033d7]       300
24  70927056007aaf7684bf4e3139965f64  会价值是至关重要的。\n提升模型正确率的关键路径\n比如为什么我们要做合成数据?比如为什么我...  70927056007aaf7684bf4e3139965f64  
[55274aaec21254cba99632471c0033d7]       300
25  eeed1b1a6ec8ed0b9df92a77b91410bd  �得大模型的价格持续走低整体来说,是一个非常正向的事。因为它本来就应该降低。同时它降低的同时...  eeed1b1a6ec8ed0b9df92a77b91410bd  
[55274aaec21254cba99632471c0033d7]       300
26  4bf095eb3a669178cd032d4eac534df4  什么要多模态?是因为真正的人在现实世界中解决问题的时候,他需要的、输入的信息本身就是多模态的...  4bf095eb3a669178cd032d4eac534df4  
[55274aaec21254cba99632471c0033d7]       300
27  5846e8d52f729f53892990e96d654be7  供更优质的服务,大家能够用这个服务创造更大的价值,然后我们创造这一部分价值应该反向再传递回来...  5846e8d52f729f53892990e96d654be7  
[55274aaec21254cba99632471c0033d7]       300
28  410250a724b1650a30b0bc8c80dc44c1  知时代的 AI,能够产生实际的效能,但是它是受限的,泛用性不够、成本太高、需要垂直化去做很多...  410250a724b1650a30b0bc8c80dc44c1  
[55274aaec21254cba99632471c0033d7]       300
29  58513fa20127169dc1e0d74a91a5804e  供泛用化的能力,解决一系列的场景和应用需求,从而来解决成本和收益平衡的问题,这是它本质的特点...  58513fa20127169dc1e0d74a91a5804e  
[55274aaec21254cba99632471c0033d7]       142
⠸ GraphRAG Indexer 
🚀 create_base_extracted_entities
                                        entity_graph
0  <graphml xmlns="http://graphml.graphdrawing.or...
🚀 create_summarized_entities
                                        entity_graph
0  <graphml xmlns="http://graphml.graphdrawing.or...
🚀 create_base_entity_graph
   level                                    clustered_graph
0      0  <graphml xmlns="http://graphml.graphdrawing.or...
/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. 
Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. 
Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
🚀 create_final_entities
                                   id                           name  ...                                      text_unit_ids                              description_embedding
0    b45241d70f0e43fca764df95b2b81f77                           "上海"  ...  [7547fc209a7f053b453c16f9ce175bb0, a98c26291b2...  [-0.008768483996391296, 0.0427890345454216, 0....
1    4119fd06010c494caa07f439b333f4c5  "2024年世界人工智能大会暨人工智能全球治理高级别会议"  ...                   [-0.008538035675883293, 0.01028020866215229, 0...
2    d3835bf3dda84ead99deadbeac5d0d7d                        "图灵奖得主"  ...                   [-0.009510371834039688, 0.04583802819252014, 0...
3    077d2820ae1845bcbb1803379a3d1eae                      "科技公司一把手"  ...                   [-0.0219394713640213, 0.04319910705089569, -0....
4    3671ea0dd4e84c1a9b02c5ab2c8f4bac                        "AI 应用"  ...                   [-0.0041207666508853436, 0.03910283371806145, ...
..                                ...                            ...  ...                                                ...                                                ...
127  aff21f1da1654e7babdcf3fb0e4a75fc                           "价格"  ...                 [5846e8d52f729f53892990e96d654be7]  [0.01759442873299122, 0.009148363955318928, -0...
128  dc2cc9016e3f49dbac7232f05cce794d                           "机器"  ...                 [5846e8d52f729f53892990e96d654be7]  [-0.00537463091313839, -0.007706150878220797, ...
129  6ea0cef05f694dcea455478f40674e45                         "实体经济"  ...  [410250a724b1650a30b0bc8c80dc44c1, 58513fa2012...  [0.014552868902683258, 0.01658778265118599, -0...
130  7ab5d53a872f4dfc98f3d386879f3c75                        "让机器思考"  ...                 [410250a724b1650a30b0bc8c80dc44c1]  [0.034711189568042755, 0.01091317180544138, -0...
131  af1d0fec22114a3398b8016f5225f9ed                    "新一代生成式 AI"  ...                 [58513fa20127169dc1e0d74a91a5804e]  [0.016883114352822304, 0.006847569718956947, -...

[132 rows x 8 columns]
/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. 
Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/site-packages/datashaper/engine/verbs/convert.py:72: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use 
to_datetime without passing `errors` and catch exceptions explicitly instead
  datetime_column = pd.to_datetime(column, errors="ignore")
/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/site-packages/datashaper/engine/verbs/convert.py:72: UserWarning: Could not infer format, so each element will be parsed individually, 
falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  datetime_column = pd.to_datetime(column, errors="ignore")
🚀 create_final_nodes
     level                          title            type                                        description  ... entity_type                 top_level_node_id  x  y
0        0                           "上海"           "GEO"  Here is the comprehensive summary:\n\n"上海 (Sha...  ...         NaN  b45241d70f0e43fca764df95b2b81f77  0  0
1        0  "2024年世界人工智能大会暨人工智能全球治理高级别会议"         "EVENT"  "The conference is a significant event that br...  ...         NaN  4119fd06010c494caa07f439b333f4c5  0  0
2        0                        "图灵奖得主"        "PERSON"  "Turing Award winners are among the attendees ...  ...         NaN  d3835bf3dda84ead99deadbeac5d0d7d  0  0
3        0                      "科技公司一把手"        "PERSON"  "Tech company executives are also present at t...  ...         NaN  077d2820ae1845bcbb1803379a3d1eae  0  0
4        0                        "AI 应用"                                                                     ...         NaN  3671ea0dd4e84c1a9b02c5ab2c8f4bac  0  0
..     ...                            ...             ...                                                ...  ...         ...                               ... .. ..
127      0                           "价格"         "PRICE"  "价格 refers to the amount of money paid for a g...  ...         NaN  aff21f1da1654e7babdcf3fb0e4a75fc  0  0
128      0                           "机器"       "MACHINE"  "机器 refers to devices that can perform tasks, ...  ...         NaN  dc2cc9016e3f49dbac7232f05cce794d  0  0
129      0                         "实体经济"  "ORGANIZATION"  Here is a comprehensive summary of the data:\n...  ...         NaN  6ea0cef05f694dcea455478f40674e45  0  0
130      0                        "让机器思考"                                                                     ...         NaN  7ab5d53a872f4dfc98f3d386879f3c75  0  0
131      0                    "新一代生成式 AI"    "TECHNOLOGY"  "New generation generative AI is a technology ...  ...         NaN  af1d0fec22114a3398b8016f5225f9ed  0  0

[132 rows x 15 columns]
/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. 
Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. 
Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
🚀 create_final_communities
  id        title  level raw_community                                   relationship_ids                                      text_unit_ids
0  0  Community 0      0             0  [a671bf7fea2f4514b6e96ba99127fafd, 525f41ea202...  [05c2ac7be39bcf579b489d309ba23ebb,236037d63b05...
1  1  Community 1      0             1  [7ce637e4f35b42e3a9f8272cab69cd22, 4d999d7744b...  [05c2ac7be39bcf579b489d309ba23ebb,410250a724b1...
2  2  Community 2      0             2  [3deb220d31f74103aa44870a36a63220, 525f41ea202...  [7fcbb470dd11fba1f05cea93c0b4a025,89d5196a8d13...
3  4  Community 4      0             4  [af7a1584dd15492cb9a4940e285f57fc, 6e8d9029ce4...  [34bfc36eaafebee552620a9da71711cc,45073e405f1a...
4  3  Community 3      0             3  [6731a665561840c2898ce8c9788e4c88, 4026806fa92...                 
🚀 join_text_units_to_entity_ids
                       text_unit_ids                                         entity_ids                                id
0   7547fc209a7f053b453c16f9ce175bb0  [b45241d70f0e43fca764df95b2b81f77, d54956b79dd...  7547fc209a7f053b453c16f9ce175bb0
1   a98c26291b2387897b27b505d502fccb  [b45241d70f0e43fca764df95b2b81f77, ef32c4b208d...  a98c26291b2387897b27b505d502fccb
2   e4f90b970ba24f1c441698ec39252396  [b45241d70f0e43fca764df95b2b81f77, 4119fd06010...  e4f90b970ba24f1c441698ec39252396
3   05c2ac7be39bcf579b489d309ba23ebb  [19a7f254a5d64566ab5cc15472df02de, e7ffaee9d31...  05c2ac7be39bcf579b489d309ba23ebb
4   236037d63b053cc5561d77c3cf97d89a  [19a7f254a5d64566ab5cc15472df02de, 254770028d7...  236037d63b053cc5561d77c3cf97d89a
5   34bfc36eaafebee552620a9da71711cc  [19a7f254a5d64566ab5cc15472df02de, dde131ab575...  34bfc36eaafebee552620a9da71711cc
6   410250a724b1650a30b0bc8c80dc44c1  [19a7f254a5d64566ab5cc15472df02de, e1fd0e904a5...  410250a724b1650a30b0bc8c80dc44c1
7   4e230251f89ebaeb16a495c1f737bc01  [19a7f254a5d64566ab5cc15472df02de, 7e2c84548fb...  4e230251f89ebaeb16a495c1f737bc01
8   5846e8d52f729f53892990e96d654be7  [19a7f254a5d64566ab5cc15472df02de, e1fd0e904a5...  5846e8d52f729f53892990e96d654be7
9   89d5196a8d134db82c434ca2f8e0022d  [19a7f254a5d64566ab5cc15472df02de, 68105770b52...  89d5196a8d134db82c434ca2f8e0022d
10  b1aa1aa2280607243e7d5c79fdaf6d5b  [19a7f254a5d64566ab5cc15472df02de, c03ab3ce8cb...  b1aa1aa2280607243e7d5c79fdaf6d5b
11  c6b307e91937f72eb51fe1ec65d38108  [19a7f254a5d64566ab5cc15472df02de, c6d1e4f56c2...  c6b307e91937f72eb51fe1ec65d38108
12  ffd76bd503d085b6c431e04247900683  [19a7f254a5d64566ab5cc15472df02de, 7f65feab754...  ffd76bd503d085b6c431e04247900683
13  70927056007aaf7684bf4e3139965f64  [e7ffaee9d31d4d3c96e04f911d0a8f9e, fd9cb733b28...  70927056007aaf7684bf4e3139965f64
14  58513fa20127169dc1e0d74a91a5804e  [e1fd0e904a53409aada44442f23a51cb, 6ea0cef05f6...  58513fa20127169dc1e0d74a91a5804e
15  aed528e5d60172692c78a9d02ebfbdd6  [e1fd0e904a53409aada44442f23a51cb, 32e6ccab20d...  aed528e5d60172692c78a9d02ebfbdd6
16  e7d446b66b5ba7e0567f0032005cb96b  [e1fd0e904a53409aada44442f23a51cb, 7cc3356d38d...  e7d446b66b5ba7e0567f0032005cb96b
17  3380e68249b1a8d8752f190053897e48  [bc0e3f075a4c4ebbb7c7b152b65a5625, 254770028d7...  3380e68249b1a8d8752f190053897e48
18  7fcbb470dd11fba1f05cea93c0b4a025  [68105770b523412388424d984e711917, 85c79fd84f5...  7fcbb470dd11fba1f05cea93c0b4a025
19  45073e405f1ae1b24b50c25e1f6dab81  [dde131ab575d44dbb55289a6972be18f, 32ee140946e...  45073e405f1ae1b24b50c25e1f6dab81
20  43a82f160b03cd9edb694538c7404bfc  [de6fa24480894518ab3cbcb66f739266, 6fae5ee1a83...  43a82f160b03cd9edb694538c7404bfc
21  ffdd22363ae3c156e72bcfe81764f2cc  [de6fa24480894518ab3cbcb66f739266, 6fae5ee1a83...  ffdd22363ae3c156e72bcfe81764f2cc
22  43cdc28708e6b851c748d5a2bebcb2f5  [94a964c6992945ebb3833dfdfdc8d655, 1eb829d0ace...  43cdc28708e6b851c748d5a2bebcb2f5
23  c0cff22a90d6fe2baad2f444c1152d19  [26f88ab3e2e04c33a459ad6270ade565, babe97e1d97...  c0cff22a90d6fe2baad2f444c1152d19
24  d26a230cf29673de6aa58a557adbcbdc  [26f88ab3e2e04c33a459ad6270ade565, c9b8ce91fc2...  d26a230cf29673de6aa58a557adbcbdc
25  bc6487902d0d4f6f2ed0db220e215a3a  [1033a18c45aa4584b2aef6ab96890351, 91ff849d12b...  bc6487902d0d4f6f2ed0db220e215a3a
26  54455c58f6a327d5c8e24d1bbdcdcda6  [53af055f068244d0ac861b2e89376495, 7ea4afbf8a2...  54455c58f6a327d5c8e24d1bbdcdcda6
27  da8eab81e95c61cf1d5dd7137809c90d  [c6d1e4f56c2843e89cf0b91c10bb6de2, 0adb2d9941f...  da8eab81e95c61cf1d5dd7137809c90d
28  eeed1b1a6ec8ed0b9df92a77b91410bd  [de837ff3d626451282ff6ac77a82216d, 460295fed3a...  eeed1b1a6ec8ed0b9df92a77b91410bd
29  4bf095eb3a669178cd032d4eac534df4  [6ea81acaf232485e94fff638e03336e1, d136b08d586...  4bf095eb3a669178cd032d4eac534df4
/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. 
Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. 
Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/site-packages/datashaper/engine/verbs/convert.py:65: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use 
to_numeric without passing `errors` and catch exceptions explicitly instead
  column_numeric = cast(pd.Series, pd.to_numeric(column, errors="ignore"))
🚀 create_final_relationships
            source                         target  weight                                        description  ... human_readable_id source_degree target_degree  rank
0             "上海"  "2024年世界人工智能大会暨人工智能全球治理高级别会议"     1.0  "The conference was held in Shanghai, attracti...  ...                 0             2             1     3
1             "上海"                         "爱乐乐团"     2.0  Here is the comprehensive summary:\n\n"\u7231\...  ...                 1             2             1     3
2          "图灵奖得主"                        "AI 应用"     1.0  "Turing Award winners are discussing the pract...  ...                 2             1             2     3
3        "科技公司一把手"                        "AI 应用"     1.0  "Tech company executives are sharing their exp...  ...                 3             1             2     3
4             "AI"                      "AI 行业大咖"     1.0  "AI industry leaders are discussing the develo...  ...                 4            10             1    11
..             ...                            ...     ...                                                ...  ...               ...           ...           ...   ...
65            "模型"                        "大模型企业"     1.0  "大模型企业 is related to the 模型, potentially using...  ...                65             2             2     4
66         "大模型企业"                         "价格降低"     1.0  "The 事件 of prices decreasing benefits 大模型企业 by...  ...                66             2             1     3
67  "LARGE MODELS"                        "智譜 AI"     1.0  "智譜 AI is a company focused on developing and ...  ...                67             1             1     2
68            "技术"                           "降价"     1.0  "The event of 降价 is driven by the advancement ...  ...                68             1             1     2
69          "实体经济"                        "让机器思考"     1.0  "实体经济 is influenced by the direction of develo...  ...                69             2             1     3

[70 rows x 10 columns]
🚀 join_text_units_to_relationship_ids
                                  id                                   relationship_ids
0   e4f90b970ba24f1c441698ec39252396  [b07a7f088364459098cd8511ff27a4c8, cd130938a28...
1   7547fc209a7f053b453c16f9ce175bb0  [8870cf2b5df64d2cab5820f67e29b9f1, b1f6164116d...
2   a98c26291b2387897b27b505d502fccb  [8870cf2b5df64d2cab5820f67e29b9f1, 896d2a51e8d...
3   05c2ac7be39bcf579b489d309ba23ebb                 
4   89d5196a8d134db82c434ca2f8e0022d  [525f41ea20274a05af4e52b625b473f3, 071a416efbe...
5   34bfc36eaafebee552620a9da71711cc  [6d8473ef3b1042bf87178a611e3dbcc6, af7a1584dd1...
6   b1aa1aa2280607243e7d5c79fdaf6d5b  [30c9641543c24773938bd8ec57ea98ab, 6731a665561...
7   c6b307e91937f72eb51fe1ec65d38108  [18b839da898e4026b81727d759d95c6a, 68e0c60d2e8...
8   4e230251f89ebaeb16a495c1f737bc01  [eeef6ae5c464400c8755900b4f1ac37a, 422433aa458...
9   410250a724b1650a30b0bc8c80dc44c1  [86505bca739d4bccaaa1a8e0f3baffdc, 9a6f414210e...
10  5846e8d52f729f53892990e96d654be7  [86505bca739d4bccaaa1a8e0f3baffdc, 1af9faf341e...
11  70927056007aaf7684bf4e3139965f64  [353d91abc68648639d65a549e59b5cf3, 4465efb7f6e...
12  aed528e5d60172692c78a9d02ebfbdd6  [7ce637e4f35b42e3a9f8272cab69cd22, 735d19aea07...
13  e7d446b66b5ba7e0567f0032005cb96b  [4d999d7744b04a998475f8f8531589f0, a0047221896...
14  3380e68249b1a8d8752f190053897e48  [db541b7260974db8bac94e953009f60e, f2ff8044718...
15  236037d63b053cc5561d77c3cf97d89a  [87915637da3e474c9349bd0ae604bd95, 8f1eba29f39...
16  7fcbb470dd11fba1f05cea93c0b4a025                 [3deb220d31f74103aa44870a36a63220]
17  45073e405f1ae1b24b50c25e1f6dab81  [6e8d9029ce4e4ea182367173ab2c7bbf, 0e8d921ccd8...
18  43a82f160b03cd9edb694538c7404bfc  [4f2c665decf242b0bfcaf7350b0e02ed, 66cdf168f36...
19  ffdd22363ae3c156e72bcfe81764f2cc  [4f2c665decf242b0bfcaf7350b0e02ed, 66cdf168f36...
20  43cdc28708e6b851c748d5a2bebcb2f5  [5a28b94bc63b44edb30c54748fd14f15, f97011b2a99...
21  c0cff22a90d6fe2baad2f444c1152d19  [35489ca6a63b47d6a8913cf333818bc1, 6fb57f83bae...
22  d26a230cf29673de6aa58a557adbcbdc  [5d3344f45e654d2c808481672f2f08dd, 70634e10a5e...
23  54455c58f6a327d5c8e24d1bbdcdcda6  [d203efdbfb2f4b2a899abfb31cf72e82, 31a7e680c4d...
24  da8eab81e95c61cf1d5dd7137809c90d  [68e0c60d2e8845d89d9d0ad397833648, 60c58026b27...
25  bc6487902d0d4f6f2ed0db220e215a3a  [351abba16e5c448994c6daf48121b14d, 50ea7d3b696...
26  eeed1b1a6ec8ed0b9df92a77b91410bd                 
27  4bf095eb3a669178cd032d4eac534df4                 [5dabc4cd05da425cb194a04482bf0c29]
❌ create_final_community_reports
None
⠸ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
├── create_base_entity_graph
├── create_final_entities
├── create_final_nodes
├── create_final_communities
├── join_text_units_to_entity_ids
├── create_final_relationships
├── join_text_units_to_relationship_ids
└── create_final_community_reports
❌ Errors occurred during the pipeline run, see logs for more details.

...

And the detail log show me:
13:23:36,521 graphrag.index.run INFO dependencies for create_final_community_reports: ['create_final_nodes', 'create_final_relationships']
13:23:36,522 graphrag.index.run INFO read table from storage: create_final_nodes.parquet
13:23:36,527 graphrag.index.run INFO read table from storage: create_final_relationships.parquet
13:23:36,570 datashaper.workflow.workflow INFO executing verb prepare_community_reports_nodes
13:23:36,589 datashaper.workflow.workflow INFO executing verb prepare_community_reports_edges
13:23:36,607 datashaper.workflow.workflow INFO executing verb restore_community_hierarchy
13:23:36,628 datashaper.workflow.workflow INFO executing verb prepare_community_reports
13:23:36,628 graphrag.index.verbs.graph.report.prepare_community_reports INFO Number of nodes at level=0 => 132
13:23:36,668 datashaper.workflow.workflow INFO executing verb create_community_reports
13:23:58,336 httpx INFO HTTP Request: POST http://10.110.0.25:11434/v1/chat/completions "HTTP/1.1 200 OK"
13:23:58,339 graphrag.llm.openai.utils ERROR error loading json, json=Here is the output in JSON format:```{    "title": "Baidu Community",    "summary": "The Baidu community revolves around Robin Li, the founder and CEO of Baidu, a Chinese technology company focused on AI applications and research. The community's dynamics are shaped by Robin Li's concerns about AI development risks and his leadership role in Baidu.",    "rating": 6.0,    "rating_explanation": "The impact severity rating is moderate due to the potential influence of Robin Li's concerns about AI development risks on the tech industry.",    "findings": [        {            "summary": "Robin Li's leadership role in Baidu",            "explanation": "Robin Li is the founder, chairman, and CEO of Baidu, a Chinese technology company. This leadership role suggests his significant influence on the company's direction and decisions [Data: Entities (31); Relationships (24)]."        },        {            "summary": "Baidu's focus on AI applications and research",            "explanation": "Baidu is a Chinese technology company focused on both AI applications and research. This focus suggests the company's potential impact on the tech industry [Data: Entities (32)]."        },        {            "summary": "Robin Li's concerns about AI development risks",            "explanation": "Robin Li is concerned about the risks associated with Artificial Intelligence (AI) development. This concern could influence his leadership decisions in Baidu and shape the company's approach to AI research [Data: Entities (31); Relationships (5)]."        }    ]}```Let me know if you need any further assistance!
Traceback (most recent call last):
  File "/data/galen_guo/workspace/LLM-Research/graphrag/graphrag/llm/openai/utils.py", line 93, in try_parse_json_object
    result = json.loads(input)
             ^^^^^^^^^^^^^^^^^
  File "/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
13:23:58,340 graphrag.llm.openai.openai_chat_llm WARNING error parsing llm json, retrying

### Additional Information

- GraphRAG Version:
- Operating System:
- Python Version:
- Related Issues:
galen1980guo commented 4 months ago

Has anyone encountered the same issue, or does anyone know how to solve this problem? Thanks.

natoverse commented 4 months ago

This is likely resolved with https://github.com/microsoft/graphrag/pull/801. We will push a release shortly, or you can update on main if you are running the codebase directly. Otherwise, there is lots of commentary linked to #657 about the JSON formats of non-OpenAI models

xxll88 commented 3 months ago

This is likely resolved with #801. We will push a release shortly, or you can update on main if you are running the codebase directly. Otherwise, there is lots of commentary linked to #657 about the JSON formats of non-OpenAI models

although json_clean_up repair json faulty responses,it increase index time and global search time 70-80% v0.2.0 index time 35min , v0.2.1 62min v0.2.0 global search time 45s , v0.2.1 75s

Vaccy-Zhu commented 3 months ago
import re  

def extract_json(input: str) -> str:  
    """  
    Extract JSON content from a string, where JSON is embedded between ```json and ``` markers.  
    """  
    text = input  
    # Define a regular expression pattern to match JSON blocks  
    pattern = r"```(.*?)```"  

    # Find all non-overlapping matches of the pattern in the string  
    matches = re.findall(pattern, text, re.DOTALL)  
    if not matches:  
        return input  

    # Return the matched JSON string, stripped of any leading or trailing whitespace  
    try:  
        return matches[0].strip()  
    except Exception:  
        raise ValueError(f"Failed to parse: {input}")

======= 用以上函数处理一下你的llm输出。 Use above function to process your LLM output.

natoverse commented 3 months ago

We have resolved several issues related to text encoding and JSON parsing that are rolled up into version 0.2.2. Please try again with that version and re-open if this is still an issue.

zidanereal5 commented 2 months ago

extract_json

请问这个是修改哪个文件