microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
16.61k stars 1.56k forks source link

[Bug]: I'm getting an error with the create_final_community_reports step on Chinese text #603

Closed crazyyanchao closed 3 weeks ago

crazyyanchao commented 1 month ago

Describe the bug

python -m graphrag.index --root ./microsoft_graphrag ERROR: ❌ create_final_community_reports None ⠋ GraphRAG Indexer ├── Loading Input (text) - 1 files loaded (0 filtered) ----- 100% 0:00:… 0:00:… ├── create_base_text_units ├── create_base_extracted_entities ├── create_summarized_entities ├── create_base_entity_graph ├── create_final_entities ├── create_final_nodes ├── create_final_communities ├── join_text_units_to_entity_ids ├── create_final_relationships ├── join_text_units_to_relationship_ids └── create_final_community_reports❌ Errors occurred during the pipeline run, see logs for more details.

Steps to reproduce

TEST DATA - news.txt

来源: 新浪财经
新闻发布时间: 2024-07-16 23:59:42
详情: 亚特兰大联储GDPNow模型预计美国第二季度GDP增速为2.5%,此前预计为2.0%。
短讯类型: 数据,观点

来源: 华尔街见闻
新闻发布时间: 2024-07-16 23:59:03
标题: 王毅同匈牙利外长西雅尔多通电话
详情: 7月16日,中共中央政治局委员、外交部长王毅应约同匈牙利外长西雅尔多通电话。西雅尔多介绍了对当前局势特别是乌克兰危机的看法以及匈方近期所作努力,表示中国是支持促进和平的重要力量,匈方愿同中方携手,防止冲突扩大升级,积累政治解决的条件。王毅说,欧尔班总理日前专程来华,彰显了两国领导人的互信和友谊,也体现了中匈新时代全天候全面战略伙伴关系的高水平。匈牙利为寻求和平奔走斡旋,发挥了建设性作用。中方认为,当前最紧迫的事项、也是最现实的目标就是推动乌克兰局势尽快降温。各方尽快就“战场不外溢、战事不升级、各方不拱火”达成共识,进而为实现停火、恢复和谈创造条件。(新华社)
短讯类型: 要闻,A股,外汇
是否重要: 0.0

来源: 凤凰
新闻发布时间: 2024-07-16 23:58:54
详情: 俄罗斯计划针对欧佩克+协议实施补偿性石油减产。

来源: 新浪财经
新闻发布时间: 2024-07-16 23:58:43
详情: 俄罗斯计划针对OPEC+协议实施补偿性石油减产。
关联股票: 石油,石油,石油,石油,石油,石油
短讯类型: 国际

来源: 凤凰
新闻发布时间: 2024-07-16 23:58:37
详情: 【美国伊利诺伊州一大坝发生溃坝 当地居民开始疏散】当地时间7月16日上午,美国伊利诺伊州华盛顿县应急管理局宣布,受强降雨影响,该州纳什维尔大坝发生溃坝。当地官员已发出疏散警告,当地居民已开始疏散。

来源: 华尔街见闻
新闻发布时间: 2024-07-16 23:58:23
标题: 报道:俄罗斯计划额外减产石油
详情: 据知情人士透露, 俄罗斯计划在2024-25年的暖和季节针对OPEC+协议实施补偿性石油减产。由于技术性缘故,俄罗斯今明两年夏季和秋季早期的石油减产幅度极可能会超过OPEC+石油减产协议规定的水平。在寒冷季节,产油国俄罗斯国内的消费需求也会增长。(彭博)
短讯类型: 要闻,外汇,黄金,石油,美股,港股
是否重要: 0.0

来源: 新浪财经
新闻发布时间: 2024-07-16 23:57:45
详情: 【美国伊利诺伊州一大坝发生溃坝 当地居民开始疏散】当地时间7月16日上午,美国伊利诺伊州华盛顿县应急管理局宣布,受强降雨影响,该州纳什维尔大坝发生溃坝。当地官员已发出疏散警告,当地居民已开始疏散。(央视新闻)
短讯类型: 国际

来源: 同花顺
新闻发布时间: 2024-07-16 23:57:14
标题: 王毅同匈牙利外长西雅尔多通电话
详情: 7月16日,中共中央政治局委员、外交部长王毅应约同匈牙利外长西雅尔多通电话。 
西雅尔多介绍了对当前局势特别是乌克兰危机的看法以及匈方近期所作努力,表示中国是支持促进和平的重要力量,匈方愿同中方携手,防止冲突扩大升级,积累政治解决的条件。 
王毅说,欧尔班总理日前专程来华,习近平主席同他就事关和平的重要议题进行战略沟通,彰显了两国领导人的互信和友谊,也体现了中匈新时代全天候全面战略伙伴关系的高水平。匈牙利为寻求和平奔走斡旋,发挥了建设性作用。中方认为,当前最紧迫的事项、也是最现实的目标就是推动乌克兰局势尽快降温。各方尽快就“战场不外溢、战事不升级、各方不拱火”达成共识,进而为实现停火、恢复和谈创造条件。中方愿同匈方一道,汇聚更多支持和平的力量,发出更多理性的声音,推动局势朝着政治解决的方向发展。(新华社)
短讯类型: 7*24小时全球直播
是否重要: 0.0

来源: 华尔街见闻
新闻发布时间: 2024-07-16 23:56:08
标题: 科大讯飞:今年1-5月AI学习机销量增长超过100%
详情: 科大讯飞披露投资者关系活动记录表显示,科大讯飞AI学习机今年1-5月份销量增长超过100%,用户净推荐值持续保持行业第一。科大讯飞2023年率先将大模型技术落地到学习机上,推出了九大AI 1对1辅导功能。此外,讯飞星火APP自去年9月正式全民开放后,目前在安卓端统计到已经累计下载了1.31亿次。在医疗领域,讯飞晓医APP可以诊断1600种常见疾病、2000多种症状,能识别有2800多种常见药品,也可以理解26万个药品相互作用。近日,科大讯飞与交通银行、人保集团、招商银行、国元证券等100多家金融机构展开了更为紧密的合作交流。
短讯类型: 要闻,A股,
是否重要: 0.0

来源: 新浪财经
新闻发布时间: 2024-07-16 23:55:46
详情: 【王毅同匈牙利外长西雅尔多通电话】7月16日,中共中央政治局委员、外交部长王毅应约同匈牙利外长西雅尔多通电话。 
西雅尔多介绍了对当前局势特别是乌克兰危机的看法以及匈方近期所作努力,表示中国是支持促进和平的重要力量,匈方愿同中方携手,防止冲突扩大升级,积累政治解决的条件。 
王毅说,欧尔班总理日前专程来华,习近平主席同他就事关和平的重要议题进行战略沟通,彰显了两国领导人的互信和友谊,也体现了中匈新时代全天候全面战略伙伴关系的高水平。匈牙利为寻求和平奔走斡旋,发挥了建设性作用。中方认为,当前最紧迫的事项、也是最现实的目标就是推动乌克兰局势尽快降温。各方尽快就“战场不外溢、战事不升级、各方不拱火”达成共识,进而为实现停火、恢复和谈创造条件。中方愿同匈方一道,汇聚更多支持和平的力量,发出更多理性的声音,推动局势朝着政治解决的方向发展。(新华社)
短讯类型: 国际

Expected Behavior

No response

GraphRAG Config Used

No response

Logs and screenshots

D:\workspace\datalab\jsrag\venv\Scripts\python.exe D:\workspace\datalab\jsrag\tests\microsoft_graphrag\test_index.py 🚀 Reading settings from settings.yaml D:\workspace\datalab\jsrag\venv\lib\site-packages\numpy\core\fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead. return bound(*args, **kwds) 🚀 create_base_text_units id ... n_tokens 0 663affd4de3a71cf9b612eae20d761d3 ... 300 1 52bd1651a5d8d56d81c1801482965a6d ... 300 2 4d0bb660afb52d68caab05f6bf0b4350 ... 300 3 7f6fd5708e48481bb673873f9ed15c30 ... 300 4 8204398b499f96829990a722591a9b83 ... 300 5 71f0b35fc170eb480295254536496a1d ... 300 6 04eeac4ad722a860496c0937e3eb1856 ... 300 7 a0f38ad84274b55756cc22177409ff46 ... 300 8 4f58b1869ff3695c8bd4e994ef8c84de ... 300 9 f170ecad55e15cfe417d0302b691ca4b ... 300 10 60d03a6c901c38de26ffd7df96aa5d18 ... 300 11 a1215748205916f7b5e0adccc9c22795 ... 300 12 b1fd07ccd1a54a8ed9b309f2d01607a9 ... 285 13 ca60e805b3f302a56091e5f3c8db2ab8 ... 85

[14 rows x 5 columns] 🚀 create_base_extracted_entities entity_graph 0 <graphml xmlns="http://graphml.graphdrawing.or... 🚀 create_summarized_entities entity_graph 0 <graphml xmlns="http://graphml.graphdrawing.or... 🚀 create_base_entity_graph level clustered_graph 0 0 <graphml xmlns="http://graphml.graphdrawing.or... D:\workspace\datalab\jsrag\venv\lib\site-packages\numpy\core\fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead. return bound(*args, *kwds) D:\workspace\datalab\jsrag\venv\lib\site-packages\numpy\core\fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead. return bound(args, **kwds) 🚀 create_final_entities id ...
description_embedding 0 b45241d70f0e43fca764df95b2b81f77 ... [-0.0014526282, -0.022810562, -0.012244933, 0.... 1 4119fd06010c494caa07f439b333f4c5 ... [-0.031926226, -0.02635819, 0.05546864, 0.0466... 2 d3835bf3dda84ead99deadbeac5d0d7d ... [-0.012088798, -0.01461496, 0.028455619, 0.096... 3 077d2820ae1845bcbb1803379a3d1eae ... [-0.016000712, -0.045031797, -0.019178852, 0.0... 4 3671ea0dd4e84c1a9b02c5ab2c8f4bac ... [-0.047121223, -0.003207534, -0.042235475, 0.0... 5 19a7f254a5d64566ab5cc15472df02de ... [-0.09580274, -0.018634425, -0.015189313, 0.03... 6 e7ffaee9d31d4d3c96e04f911d0a8f9e ... [-0.068246044, -0.032368124, -0.013941692, 0.0... 7 f7e11b0e297a44a896dc67928368f600 ... [-0.08585356, 0.008408449, 0.011477692, 0.0618... 8 1fd3fa8bb5a2408790042ab9573779ee ... [0.028590905, -0.035342693, 0.027007151, 0.101... 9 27f9fbe6ad8c4a8b9acee0d3596ed57c ... [-0.025687896, -0.03244348, 0.0051344517, 0.07... 10 e1fd0e904a53409aada44442f23a51cb ... [-0.050529823, -0.0054584057, -0.025121942, 0.... 11 de988724cfdf45cebfba3b13c43ceede ... [-0.07442019, 0.016910737, -0.014245165, 0.060... 12 96aad7cb4b7d40e9b7e13b94a67af206 ... [-0.04028788, -0.0081702685, -0.033586647, 0.0... 13 c9632a35146940c2a86167c7726d35e9 ... [-0.04757148, -0.01928787, -0.011983054, 0.014... 14 9646481f66ce4fd2b08c2eddda42fc82 ... [-0.06481394, -0.003877871, -0.00751088, 0.048... 15 d91a266f766b4737a06b0fda588ba40b ... [-0.06696816, 0.008392776, -0.0054419786, 0.03... 16 bc0e3f075a4c4ebbb7c7b152b65a5625 ... [-0.011605713, -0.030790666, -0.0008902109, 0.... 17 254770028d7a4fa9877da4ba0ad5ad21 ... [-0.045195457, -0.02490488, -0.024781283, -0.0... 18 4a67211867e5464ba45126315a122a8a ... [-0.05511852, 0.0013663026, 0.043397218, 0.002... 19 04dbbb2283b845baaeac0eaf0c34c9da ... [-0.058006745, 0.009991474, 0.04448104, 0.0205... 20 1943f245ee4243bdbfbd2fd619ae824a ... [-0.03958215, -7.2115225e-05, 0.08936498, 0.02... 21 273daeec8cad41e6b3e450447db58ee7 ... [-0.032186434, 0.015104188, 0.03962565, 0.0865... 22 e69dc259edb944ea9ea41264b9fcfe59 ... [-0.04429078, 0.018155806, 0.03003922, 0.09310... 23 e2f5735c7d714423a2c4f61ca2644626 ... [-0.055190273, 0.017616868, 0.016507143, 0.086... 24 deece7e64b2a4628850d4bb6e394a9c3 ... [-0.031237053, 0.037238844, 0.02804798, 0.0785... 25 e657b5121ff8456b9a610cfaead8e0cb ... [0.027616562, 0.035711832, 0.03694708, 0.02657... 26 bf4e255cdac94ccc83a56435a5e4b075 ... [0.044378877, 0.03286549, 0.053403515, 0.07247... 27 3b040bcc19f14e04880ae52881a89c1c ... [-0.03962782, 0.010889069, 0.045615856, -0.003... 28 3d6b216c14354332b1bf1927ba168986 ... [-0.018539483, 0.015377053, 0.006324861, 0.004... 29 1c109cfdc370463eb6d537e5b7b382fb ... [-0.065681376, -0.004882022, 0.03561782, 0.017... 30 3d0dcbc8971b415ea18065edc4d8c8ef ... [0.029657133, -0.023170393, -0.0046286224, 0.0... 31 68105770b523412388424d984e711917 ... [-0.030532323, 0.037991133, -0.0011697958, 0.0... 32 85c79fd84f5e4f918471c386852204c5 ... [-0.04545134, 0.03209481, 0.028792372, 0.08772... 33 eae4259b19a741ab9f9f6af18c4a0470 ... [-0.031185862, 0.057401102, 0.026549801, 0.042... 34 3138f39f2bcd43a69e0697cd3b05bc4d ... [0.05237491, 0.0035916413, -0.01448442, 0.0371... 35 dde131ab575d44dbb55289a6972be18f ... [0.029156113, 0.05686912, 0.027205246, 0.05200... 36 de9e343f2e334d88a8ac7f8813a915e5 ... [-0.056767624, 0.006740077, 0.035738584, 0.068... 37 e2bf260115514fb3b252fd879fb3e7be ... [-0.024709905, -0.07275798, -0.033372466, 0.01... 38 b462b94ce47a4b8c8fffa33f7242acec ... [-0.05806274, -0.036697976, 0.010909473, -0.01... 39 17ed1d92075643579a712cc6c29e8ddb ... [-0.010521335, -0.017530346, -0.0148518095, 0.... 40 3ce7c210a21b4deebad7cc9308148d86 ... [-0.0099690715, -0.021275608, 0.015752371, 0.0... 41 d64ed762ea924caa95c8d06f072a9a96 ... [-0.006371636, -0.031781916, -0.043901198, 0.0... 42 adf4ee3fbe9b4d0381044838c4f889c8 ... [0.03008028, -0.054739267, 0.03621213, 0.04168... 43 32ee140946e5461f9275db664dc541a5 ... [0.0067304475, -0.036523268, -0.04421857, -0.0... 44 c160b9cb27d6408ba6ab20214a2f3f81 ... [-0.018944053, -0.02489056, 0.0071473666, 0.02... 45 23527cd679ff4d5a988d52e7cd056078 ... [0.027823413, -0.06449872, 0.010141867, 0.0303... 46 f1c6eed066f24cbdb376b910fce29ed4 ... [0.028539598, -0.06675614, 0.042784836, 0.0615... 47 83a6cb03df6b41d8ad6ee5f6fef5f024 ... [0.0039935475, -0.035595946, 0.0060721235, 0.0... 48 147c038aef3e4422acbbc5f7938c4ab8 ... [0.0077742436, -0.036676265, -0.0007366069, 0.... 49 b7702b90c7f24190b864e8c6e64612a5 ... [-0.06719572, -0.02191515, -0.06679287, 0.0249... 50 de6fa24480894518ab3cbcb66f739266 ... [-0.03515449, -0.020157838, -0.03369378, 0.027... 51 6fae5ee1a831468aa585a1ea09095998 ... [-0.051609986, 0.010323131, -0.036757376, 0.02... 52 ef32c4b208d041cc856f6837915dc1b0 ... [0.008965356, -0.037247375, 0.03903756, 0.0624... 53 07b2425216bd4f0aa4e079827cb48ef5 ... [-0.032199547, -0.027251536, 0.040896047, 0.03...

[54 rows x 8 columns] D:\workspace\datalab\jsrag\venv\lib\site-packages\numpy\core\fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead. return bound(*args, **kwds) D:\workspace\datalab\jsrag\venv\lib\site-packages\datashaper\engine\verbs\convert.py:72: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_datetime without passing errors and catch exceptions explicitly instead datetime_column = pd.to_datetime(column, errors="ignore") D:\workspace\datalab\jsrag\venv\lib\site-packages\datashaper\engine\verbs\convert.py:72: UserWarning: Could not infer format, so each element will be parsed individually, falling back to dateutil. To ensure parsing is consistent and as-expected, please specify a format. datetime_column = pd.to_datetime(column, errors="ignore") 🚀 create_final_nodes level title ... x y 0 0 "新浪财经" ... 0 0 1 0 "亚特兰大联储" ... 0 0 2 0 "美国" ... 0 0 3 0 "华尔街见闻" ... 0 0 4 0 "王毅" ... 0 0 5 0 "匈牙利" ... 0 0 6 0 "西雅尔多" ... 0 0 7 0 "乌克兰危机" ... 0 0 8 0 "中国" ... 0 0 9 0 "CHINA" ... 0 0 10 0 "HUNGARY" ... 0 0 11 0 "UKRAINE" ... 0 0 12 0 "WANG YI" ... 0 0 13 0 "ORBAN" ... 0 0 14 0 "UKRAINE CRISIS" ... 0 0 15 0 "乌克兰" ... 0 0 16 0 "新华社" ... 0 0 17 0 "凤凰" ... 0 0 18 0 "俄罗斯" ... 0 0 19 0 "欧佩克+协议" ... 0 0 20 0 "OPEC+" ... 0 0 21 0 "美国伊利诺伊州" ... 0 0 22 0 "华盛顿县应急管理局" ... 0
0 23 0 "纳什维尔大坝" ... 0 0 24 0 "溃坝事件" ... 0 0 25 0 "当地官员" ... 0 0 26 0 "当地居民" ... 0 0 27 0 "OPEC+协议" ... 0 0 28 0 "彭博" ... 0 0 29 0 "RUSSIA" ... 0 0 30 0 "SINA FINANCE" ... 0 0 31 0 "ILLINOIS" ... 0 0 32 0 "WASHINGTON COUNTY EMERGENCY MANAGEMENT BUREAU" ... 0 0 33 0 "NASHVILLE DAM" ... 0 0 34 0 "CCTV NEWS" ... 0 0 35 0 "DAM COLLAPSE" ... 0 0 36 0 "美国伊利诺伊州华盛顿县应急管理局"
... 0 0 37 0 "习近平" ... 0 0 38 0 "欧尔班" ... 0 0 39 0 "XINHUA NEWS AGENCY" ... 0 0 40 0 "IFLYTEK" ... 0 0 41 0 "WALL STREET SEEN" ... 0 0 42 0 "科大讯飞" ... 0 0 43 0 "讯飞星火APP" ... 0 0 44 0 "讯飞晓医APP" ... 0 0 45 0 "交通银行" ... 0 0 46 0 "人保集团" ... 0 0 47 0 "招商银行" ... 0 0 48 0 "国元证券" ... 0 0 49 0 "SZIJJARTO" ... 0 0 50 0 "CENTRAL POLITICAL BUREAU OF THE COMMUNIST PAR... ... 0 0 51 0 "PHONE CALL" ... 0 0 52 0 "中方" ... 0 0 53 0 "匈方" ... 0 0

[54 rows x 14 columns] D:\workspace\datalab\jsrag\venv\lib\site-packages\numpy\core\fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead. return bound(*args, *kwds) D:\workspace\datalab\jsrag\venv\lib\site-packages\numpy\core\fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead. return bound(args, **kwds) 🚀 create_final_communities id ... text_unit_ids 0 1 ... [663affd4de3a71cf9b612eae20d761d3,8204398b499f... 1 2 ... [663affd4de3a71cf9b612eae20d761d3,a0f38ad84274... 2 0 ... [4d0bb660afb52d68caab05f6bf0b4350,b1fd07ccd1a5...

[3 rows x 6 columns] 🚀 join_text_units_to_entity_ids text_unit_ids ... id 0 4d0bb660afb52d68caab05f6bf0b4350 ... 4d0bb660afb52d68caab05f6bf0b4350 1 663affd4de3a71cf9b612eae20d761d3 ... 663affd4de3a71cf9b612eae20d761d3 2 8204398b499f96829990a722591a9b83 ... 8204398b499f96829990a722591a9b83 3 04eeac4ad722a860496c0937e3eb1856 ... 04eeac4ad722a860496c0937e3eb1856 4 60d03a6c901c38de26ffd7df96aa5d18 ... 60d03a6c901c38de26ffd7df96aa5d18 5 a0f38ad84274b55756cc22177409ff46 ... a0f38ad84274b55756cc22177409ff46 6 b1fd07ccd1a54a8ed9b309f2d01607a9 ... b1fd07ccd1a54a8ed9b309f2d01607a9 7 4f58b1869ff3695c8bd4e994ef8c84de ... 4f58b1869ff3695c8bd4e994ef8c84de 8 52bd1651a5d8d56d81c1801482965a6d ... 52bd1651a5d8d56d81c1801482965a6d 9 a1215748205916f7b5e0adccc9c22795 ... a1215748205916f7b5e0adccc9c22795 10 ca60e805b3f302a56091e5f3c8db2ab8 ... ca60e805b3f302a56091e5f3c8db2ab8 11 7f6fd5708e48481bb673873f9ed15c30 ... 7f6fd5708e48481bb673873f9ed15c30 12 71f0b35fc170eb480295254536496a1d ... 71f0b35fc170eb480295254536496a1d 13 f170ecad55e15cfe417d0302b691ca4b ... f170ecad55e15cfe417d0302b691ca4b

[14 rows x 3 columns] D:\workspace\datalab\jsrag\venv\lib\site-packages\numpy\core\fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead. return bound(*args, *kwds) D:\workspace\datalab\jsrag\venv\lib\site-packages\numpy\core\fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead. return bound(args, **kwds) D:\workspace\datalab\jsrag\venv\lib\site-packages\datashaper\engine\verbs\convert.py:65: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing errors and catch exceptions explicitly instead column_numeric = cast(pd.Series, pd.to_numeric(column, errors="ignore")) 🚀 create_final_relationships source ... rank 0 "新浪财经" ... 4 1 "新浪财经" ... 7 2 "亚特兰大联储" ... 3 3 "华尔街见闻" ... 4 4 "王毅" ... 5 5 "王毅" ... 7 6 "匈牙利" ... 4 7 "匈牙利" ... 7 8 "匈牙利" ... 6 9 "西雅尔多" ... 3 10 "CHINA" ... 6 11 "CHINA" ... 6 12 "CHINA" ... 7 13 "HUNGARY" ... 6 14 "HUNGARY" ... 7 15 "UKRAINE" ... 7 16 "WANG YI" ... 6 17 "WANG YI" ... 6 18 "WANG YI" ... 8 19 "WANG YI" ... 5 20 "ORBAN" ... 4 21 "乌克兰" ... 6 22 "乌克兰" ... 7 23 "新华社" ... 5 24 "新华社" ... 5 25 "凤凰" ... 6 26 "俄罗斯" ... 6 27 "俄罗斯" ... 7 28 "俄罗斯" ... 6 29 "OPEC+" ... 3 30 "美国伊利诺伊州" ... 3 31 "华盛顿县应急管理局" ... 6 32 "纳什维尔大坝" ... 5 33 "纳什维尔大坝" ... 6 34 "纳什维尔大坝" ... 5 35 "当地官员" ... 3 36 "ILLINOIS" ... 4 37 "ILLINOIS" ... 4 38 "WASHINGTON COUNTY EMERGENCY MANAGEMENT BUREAU" ... 4 39 "NASHVILLE DAM" ... 4 40 "习近平" ... 5 41 "IFLYTEK" ... 2 42 "科大讯飞" ... 7 43 "科大讯飞" ... 7 44 "科大讯飞" ... 7 45 "科大讯飞" ... 7 46 "科大讯飞" ... 7 47 "科大讯飞" ... 7 48 "中方" ... 4

[49 rows x 10 columns] 🚀 join_text_units_to_relationship_ids id
relationship_ids 0 663affd4de3a71cf9b612eae20d761d3 [2670deebfa3f4d69bb82c28ab250a209, b785a902506... 1 4d0bb660afb52d68caab05f6bf0b4350 [404309e89a5241d6bff42c05a45df206, ed6d2eee9d7... 2 04eeac4ad722a860496c0937e3eb1856 [d54956b79dd147f894b67a8b97dcbef0, 1745a2485a9... 3 60d03a6c901c38de26ffd7df96aa5d18 [d54956b79dd147f894b67a8b97dcbef0, 3c063eea52e... 4 a0f38ad84274b55756cc22177409ff46 [958beecdb5bb4060948415ffd75d2b03, b999ed77e19... 5 b1fd07ccd1a54a8ed9b309f2d01607a9 [48c0c4d72da74ff5bb926fa0c856d1a7, 4f3c97517f7... 6 4f58b1869ff3695c8bd4e994ef8c84de [32e6ccab20d94029811127dbbe424c64, 94a964c6992... 7 52bd1651a5d8d56d81c1801482965a6d [32e6ccab20d94029811127dbbe424c64, 94a964c6992... 8 a1215748205916f7b5e0adccc9c22795 [1eb829d0ace042089f0746f78729696c, 26f88ab3e2e... 9 ca60e805b3f302a56091e5f3c8db2ab8 [56d0e5ebe79e4814bd1463cf6ca21394, 7c49f2710e8... 10 7f6fd5708e48481bb673873f9ed15c30 [6b02373137fd438ba96af28f735cdbdb, d2b629c0396... 11 8204398b499f96829990a722591a9b83 [36a4fcd8efc144e6b8af9a1c7ab8b2ce, e22d1d1cd8d... 12 71f0b35fc170eb480295254536496a1d [fbeef791d19b413a9c93c6608286ab63, 89c08e79329... 13 f170ecad55e15cfe417d0302b691ca4b [bb9e01bc171d4326a29afda59ece8d17, 3c063eea52e... ❌ create_final_community_reports None ⠋ GraphRAG Indexer ├── Loading Input (text) - 1 files loaded (0 filtered) ----- 100% 0:00:… 0:00:… ├── create_base_text_units ├── create_base_extracted_entities ├── create_summarized_entities ├── create_base_entity_graph ├── create_final_entities ├── create_final_nodes ├── create_final_communities ├── join_text_units_to_entity_ids ├── create_final_relationships ├── join_text_units_to_relationship_ids └── create_final_community_reports❌ Errors occurred during the pipeline run, see logs for more details.

Process finished with exit code 1

Additional Information

crazyyanchao commented 1 month ago

In addition, the Chinese in the cache folder are not well displayed.

{"result": "\"\\u8baf\\u98de\\u6653\\u533bAPP\" is a medical application developed by \\u79d1\\u5927\\u8baf\\u98de. This application is capable of diagnosing 1600 common diseases and symptoms, recognizing over 2800 common medications, and understanding 260,000 drug interactions. Additionally, it has the ability to comprehend a vast number of medical terms, making it a comprehensive tool in the medical field.", "input": "\nYou are a helpful assistant responsible for generating a comprehensive summary of the data provided below.\nGiven one or two entities, and a list of descriptions, all related to the same entity or group of entities.\nPlease concatenate all of these into a single, comprehensive description. Make sure to include information collected from all the descriptions.\nIf the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary.\nMake sure it is written in third person, and include the entity names so we the have full context.\n\n#######\n-Data-\nEntities: \"\\\"\\u8baf\\u98de\\u6653\\u533bAPP\\\"\"\nDescription List: [\"\\\"\\u8baf\\u98de\\u6653\\u533bAPP is a medical application by \\u79d1\\u5927\\u8baf\\u98de that can diagnose numerous diseases and symptoms, recognize a wide range of medications, and understand a vast number of medical terms.\\\"\", \"\\\"\\u8baf\\u98de\\u6653\\u533bAPP is an application in the medical field capable of diagnosing 1600 common diseases, recognizing over 2800 common drugs, and understanding 260,000 drug interactions.\\\"\"]\n#######\nOutput:\n", "parameters": {"model": "gpt-4o", "temperature": 0.0, "frequency_penalty": 0.0, "presence_penalty": 0.0, "top_p": 1.0, "max_tokens": 500, "n": null}}
Amitabh-Priyadarshi-Bayer commented 1 month ago

I am getting the same error. look into my post I drilled down to some extent. and came to know there is jsondecoder error because of {{ in the system message of community reports (prompts/community_reports.txt). I changed it to single { and then I found out in report that community_reports.txt is null in reports but its present in system.yaml.

more info https://github.com/microsoft/graphrag/discussions/573

can you share your indexing-engine.log in output/reports

albaNnaksqr commented 1 month ago

I am getting the same error. look into my post I drilled down to some extent. and came to know there is jsondecoder error because of {{ in the system message of community reports (prompts/community_reports.txt). I changed it to single { and then I found out in report that community_reports.txt is null in reports but its present in system.yaml.

more info #573

can you share your indexing-engine.log in output/reports

I think it is because the use of .format() to fill the information in the prompt. When there is already brace character({}) in the text, you need to use double brace characters ({{ }}) instead. https://docs.python.org/3/library/string.html#formatstrings

image

But I also encountered the same error that the output of model has double brace characters, which should never appear in the filled prompt given to the model...

crazyyanchao commented 1 month ago

@Amitabh-Priyadarshi-Bayer I added extract_json_dict function to the 'graphrag/llm/openai/utils.py' file to solve the DICT error.

def try_parse_json_object(input: str) -> dict:
    """Generate JSON-string output using best-attempt prompting & parsing techniques."""
    try:
        # result = json.loads(input)
        result = extract_json_dict(input)
    except json.JSONDecodeError:
        log.exception("error loading json, json=%s", input)
        raise
    else:
        if not isinstance(result, dict):
            raise TypeError
        return result

def extract_json_dict(text: str):
    """Parse dict from text."""
    pattern = r'\{[^{}]*\}'
    match = re.search(pattern, text)
    if match:
        json_str = match.group()
        try:
            json_dict = json.loads(json_str)
            return json_dict
        except json.JSONDecodeError:
            return None
    else:
        return None

And then I got the error of graphrag.index.graph.extractors.community_reports.community_reports_extractor.

IdaWoods commented 1 month ago

I also encountered the same problem yesterday, but today I didn't make any changes and miraculously succeeded when I ran it again

crazyyanchao commented 1 month ago

I've fixed the issue, mostly due to unstable parse functions. The 'try_parse_json_object' function in the graphrag/llm/openai/utils.py code has been modified as follows:

def try_parse_json_object(input: str) -> dict:
    """Generate JSON-string output using best-attempt prompting & parsing techniques."""
    try:
        clean_json = clean_up_json(input)
        result = json.loads(clean_json)
    except json.JSONDecodeError:
        log.exception("error loading json, json=%s", input)
        raise
    else:
        if not isinstance(result, dict):
            raise TypeError
        return result

def clean_up_json(json_str: str) -> str:
    """Clean up json string."""
    json_str = (
        json_str.replace("\\n", "")
        .replace("\n", "")
        .replace("\r", "")
        .replace('"[{', "[{")
        .replace('}]"', "}]")
        .replace("\\", "")
        # Refer: graphrag\llm\openai\_json.py,graphrag\index\utils\json.py
        .replace("{{", "{")
        .replace("}}", "}")
        .strip()
    )

    # Remove JSON Markdown Frame
    if json_str.startswith("```json"):
        json_str = json_str[len("```json"):]
    if json_str.endswith("```"):
        json_str = json_str[: len(json_str) - len("```")]
    return json_str
Amitabh-Priyadarshi-Bayer commented 1 month ago

@crazyyanchao
@AlonsoGuevara

There is an easy way to do it by changing the System message for community_report. change all {{ in community_report.txt to {, so that gpt will generate the json in correct format. rather than changing the codebase.

but the problem is GraphRAG is not reading from system message file defined for community reports in settings.yaml. in your log file also in community report section prompt value is null, it should be the system message filename that is mentioned in settings.yaml.

crazyyanchao commented 1 month ago

@crazyyanchao @AlonsoGuevara

There is an easy way to do it by changing the System message for community_report. change all {{ in community_report.txt to {, so that gpt will generate the json in correct format. rather than changing the codebase.

but the problem is GraphRAG is not reading from system message file defined for community reports in settings.yaml. in your log file also in community report section prompt value is null, it should be the system message filename that is mentioned in settings.yaml.

Thank you for your reply, I have understood the issue in depth. In addition, I would like to add that the current parsing function is indeed unstable, and I suggest following the practice of langchain so that users can customize the parser.

minxiansheng commented 1 month ago

I found that the same article, if you have too many words, you will report an error in create final entities. If you delete some words, you will succeed. Which parameter does this word count relate to, the embedding model or the settting.yaml?

minglong-huang commented 1 month ago

哇塞 真的哭死 解决了好久 没搞定 ~谢谢了 好人一生平安

cenlibin commented 1 month ago

In addition, the Chinese in the cache folder are not well displayed.

* graphrag\cache\summarize_descriptions\summarize-chat-v2-0a51e37418831e8ba9bc4fc845b00f56
{"result": "\"\\u8baf\\u98de\\u6653\\u533bAPP\" is a medical application developed by \\u79d1\\u5927\\u8baf\\u98de. This application is capable of diagnosing 1600 common diseases and symptoms, recognizing over 2800 common medications, and understanding 260,000 drug interactions. Additionally, it has the ability to comprehend a vast number of medical terms, making it a comprehensive tool in the medical field.", "input": "\nYou are a helpful assistant responsible for generating a comprehensive summary of the data provided below.\nGiven one or two entities, and a list of descriptions, all related to the same entity or group of entities.\nPlease concatenate all of these into a single, comprehensive description. Make sure to include information collected from all the descriptions.\nIf the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary.\nMake sure it is written in third person, and include the entity names so we the have full context.\n\n#######\n-Data-\nEntities: \"\\\"\\u8baf\\u98de\\u6653\\u533bAPP\\\"\"\nDescription List: [\"\\\"\\u8baf\\u98de\\u6653\\u533bAPP is a medical application by \\u79d1\\u5927\\u8baf\\u98de that can diagnose numerous diseases and symptoms, recognize a wide range of medications, and understand a vast number of medical terms.\\\"\", \"\\\"\\u8baf\\u98de\\u6653\\u533bAPP is an application in the medical field capable of diagnosing 1600 common diseases, recognizing over 2800 common drugs, and understanding 260,000 drug interactions.\\\"\"]\n#######\nOutput:\n", "parameters": {"model": "gpt-4o", "temperature": 0.0, "frequency_penalty": 0.0, "presence_penalty": 0.0, "top_p": 1.0, "max_tokens": 500, "n": null}}

我通过修改json库的dump和dumps方法的ensure_ascm参数默认值为False似乎能暂时解决这个问题

natoverse commented 3 weeks ago

We have resolved several issues related to text encoding and JSON parsing that are rolled up into version 0.2.2. Please try again with that version and re-open if this is still an issue.