gusye1234 / nano-graphrag

A simple, easy-to-hack GraphRAG implementation
MIT License
841 stars 87 forks source link

encounter many 'gbk' codec errors #50

Open cyberflying opened 1 week ago

cyberflying commented 1 week ago

Since offical GraphRAG require UTF-8 encoding, prepare some input files which is UTF-8 format. When using in this nano-graphrag, I hardcode with encoding='utf-8' , but encount many 'gbk' codec errors, could I have a global config to determine the encoding format?

for example, in _storage.py: image

Thanks!

cyberflying commented 1 week ago

and the output file "vdb_entities.json" encoding is also not utf-8 image

luckfu commented 1 week ago

是不是在windows下用的啊?我也遇到过,可能没有指定utf8写

cyberflying commented 1 week ago

是不是在windows下用的啊?我也遇到过,可能没有指定utf8写

对,windows下。改代码指定了,发现又有其他地方报同样的错,还是请作者改下源码吧。

gusye1234 commented 1 week ago

已经修复 可以pull下最新的代码测试下 Fixed save/write encoding problem of utf-8

cyberflying commented 1 week ago

已经修复 可以pull下最新的代码测试下 Fixed save/write encoding problem of utf-8

感谢回复!不过还是报错了, line 121 in _storage.py: Exception has occurred: UnicodeEncodeError 'gbk' codec can't encode character '\uc0bc' in position 3: illegal multibyte sequence File "C:\demo\nano-graphrag\nano_graphrag\graphrag.py", line 312, in ainsert await self.chunk_entity_relation_graph.clustering( File "C:\demo\nano-graphrag\nano_graphrag_storage.py", line 374, in clustering await self._clustering_algorithms[algorithm]() File "C:\demo\nano-graphrag\nano_graphrag_storage.py", line 437, in _leiden_clustering from graspologic.partition import hierarchical_leiden ModuleNotFoundError: No module named 'past'

During handling of the above exception, another exception occurred:

File "C:\demo\nano-graphrag\nano_graphrag_storage.py", line 121, in index_done_callback self._client.save() File "C:\demo\nano-graphrag\nano_graphrag\graphrag.py", line 339, in _insert_done await asyncio.gather(*tasks) File "C:\demo\nano-graphrag\nano_graphrag\graphrag.py", line 323, in ainsert await self._insert_done() File "C:\demo\nano-graphrag\nano_graphrag\graphrag.py", line 205, in insert return loop.run_until_complete(self.ainsert(string_or_strings)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\demo\nano-graphrag\test.py", line 12, in graph_func.insert(f.read()) UnicodeEncodeError: 'gbk' codec can't encode character '\uc0bc' in position 3: illegal multibyte sequence

另外: write的文件 vdb_entities.json, UTF-8打开还是乱码,gb2312打开正常。

gusye1234 commented 1 week ago

是新的working dir吗?

cyberflying commented 1 week ago

原来的working_dir,只保留了原文件.txt,删除了其他所有中间产生的文件。我也换4o为4o-mini了,不删除中间文件会报错。

rangehow commented 1 week ago

你更新仓库的方式是 pip install git+ 吗? btw 你需要pip install future

cyberflying commented 1 week ago

哦,更新了仓库,却忘了更新pip install 的nano :( 回头再测试下。多谢提醒!

获取 Outlook for iOShttps://aka.ms/o0ukef


发件人: Rangehow @.> 发送时间: Friday, September 20, 2024 2:45:15 PM 收件人: gusye1234/nano-graphrag @.> 抄送: Author @.>; Comment @.> 主题: Re: [gusye1234/nano-graphrag] encounter many 'gbk' codec errors (Issue #50)

你更新仓库的方式是 pip install git+ 吗? btw 你需要pip install future

― Reply to this email directly, view it on GitHubhttps://github.com/gusye1234/nano-graphrag/issues/50#issuecomment-2362950488 or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABPZDAWGQ3HYQPO2LDDNMB3ZXO77ZBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVE4DGMZVGEZDGNRXQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRSGUZTGMBRGE2TCOFHORZGSZ3HMVZKMY3SMVQXIZI. You are receiving this email because you authored the thread.

Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.