microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
12.24k stars 1.01k forks source link

[Serious bug] text files that do not support Chinese content #424

Open Pandas886 opened 1 week ago

Pandas886 commented 1 week ago

I attempted to conduct an RAG test using Qian Zhongshu's "Fortress Besieged" and encountered the following errors.

the pipeline msg:

❌ create_final_community_reports
None
⠋ GraphRAG Indexer 
├── Loading Input (text) - 1 files loaded (0 filtered) ----- 100% 0:00:… 0:00:…
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
├── create_base_entity_graph
├── create_final_entities
├── create_final_nodes
├── create_final_communities
├── join_text_units_to_entity_ids
├── create_final_relationships
├── join_text_units_to_relationship_ids
└── create_final_community_reports❌ Errors occurred during the pipeline run, see logs for more details.

View logs like below:

10:54:04,391 graphrag.llm.openai.utils ERROR error loading json, json=```json
{
    "title": "�����̼�������������",
    "summary": "�������Է����̼���Ϊ���ģ��漰���м��������ʵ�塣��������Ϊ���峤�����Լ�ͥ���������Ӱ�죬������������ҽҩ���档������Ϊ�������ģ�������Ա�����еĹ����Ͳ�����ϵ�������ڵĹ�ϵ���ӣ��漰�����ڲ��Ľ��������þ����Լ����ⲿʵ��Ļ�����",
    "rating": 6.5,
    "rating_explanation": "��������Ӱ������Ϊ�е�ƫ�ϣ���Ҫ��Ϊ�����ڲ��Ľ����;��þ��߿��ܶ����������������Ӧ��",
    "findings": [
        {
            "summary": "�������ڼ����еĺ��ĵ�λ",
            "explanation": "��������Ϊ�����еij������Լ�ͥ��������Զ��Ӱ�졣�����������ӵ��������ж��صļ��⣬���Լ�ͥҽҩ������Ȥ�������Դ����ϱ���IJ��顣�����̵���Ϊ�;����ڼ����о����쵼���ã������ռǺ�����¼��ϸ��¼�������쵼���������Կ���[Data: Entities (282), Relationships (639, 1316, 1323, 1317, 1320, 1325, 1321, 1318, 1324, 1322)]��"
        },
        {
            "summary": "�����������еĽ�ɫ",
            "explanation": "���в����Ǽ����Ա�����ĵص㣬Ҳ�Ǵ�������׺ʹ�������ġ����轥�����й�������ʾ������ְҵ��ݺ������еĹ�ϵ�����л��漰�������Ա�ľ��þ��ߣ��緽�轥�ƻ�ȥ����֧���˵�����ʾ���IJ���������ͥ����״���й�[Data: Entities (122), Relationships (681, 1075, 276, 1073, 1079, 1071, 1063, 1076, 1074, 1077, 1078)]��"
        },
        {
            "summary": "�����ڲ��Ľ�����������ͳ",
            "explanation": "�����̶����ӵ������ж��صļ��⣬���Ϊ����ȡ��Ϊ���ǹ����������ù��������������塣����������ͳ��ӳ�˼���Խ��������ӺͶԴ�ͳ�Ļ������ء������̵���Ϊ�;����ڼ����о����쵼���ã������ռǺ�����¼��ϸ��¼�������쵼���������Կ���[Data: Entities (282), Relationships (1316, 1321, 1318, 1320)]��"
        },
        {
            "summary": "�����Ա���ⲿʵ��Ļ���",
            "explanation": "���轥�����еĹ�ϵ�������ڹ��������������ⲿʵ��Ļ�����������С����ż����������ֻ�����ʾ�˼����Ա�������е��罻��ְҵ���硣������Ϊ�������ģ�������Ա�����еĹ����Ͳ�����ϵ[Data: Entities (122), Relationships (681, 276, 1074)]��"
        },
        {
            "summary": "�����Ա�ľ��þ���",
            "explanation": "�������ھ��þ����ϱ��ֳ��������������ѵ��Ϻ��󣬲�ԸΪ���ӹ�Ӷ��ĸ�����־��þ��߷�ӳ�˼������ض������µ���Ӧ�ԺͶ���Դ�ĺ������á�������Ϊ�������ģ�������Ա�����еĹ����Ͳ�����ϵ[Data: Entities (282), Relationships (449, 540, 454, 1071, 1063, 1076, 1074, 1077, 1078)]��"
        }
    ]
}

Traceback (most recent call last):
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\graphrag\llm\openai\utils.py", line 93, in try_parse_json_object
    result = json.loads(input)
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
10:54:04,391 graphrag.index.graph.extractors.community_reports.community_reports_extractor ERROR error generating community report
Traceback (most recent call last):
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\graphrag\index\graph\extractors\community_reports\community_reports_extractor.py", line 58, in __call__
    await self._llm(
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\graphrag\llm\openai\json_parsing_llm.py", line 34, in __call__
    result = await self._delegate(input, **kwargs)
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\graphrag\llm\openai\openai_token_replacing_llm.py", line 37, in __call__
    return await self._delegate(input, **kwargs)
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\graphrag\llm\openai\openai_history_tracking_llm.py", line 33, in __call__
    output = await self._delegate(input, **kwargs)
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\graphrag\llm\base\caching_llm.py", line 104, in __call__
    result = await self._delegate(input, **kwargs)
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\graphrag\llm\base\rate_limiting_llm.py", line 177, in __call__
    result, start = await execute_with_retry()
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\graphrag\llm\base\rate_limiting_llm.py", line 159, in execute_with_retry
    async for attempt in retryer:
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\tenacity\asyncio\__init__.py", line 166, in __anext__
    do = await self.iter(retry_state=self._retry_state)
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\tenacity\asyncio\__init__.py", line 153, in iter
    result = await action(retry_state)
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\tenacity\_utils.py", line 99, in inner
    return call(*args, **kwargs)
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\tenacity\__init__.py", line 398, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\graphrag\llm\base\rate_limiting_llm.py", line 165, in execute_with_retry
    return await do_attempt(), start
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\graphrag\llm\base\rate_limiting_llm.py", line 147, in do_attempt
    return await self._delegate(input, **kwargs)
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\graphrag\llm\base\base_llm.py", line 48, in __call__
    return await self._invoke_json(input, **kwargs)
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\graphrag\llm\openai\openai_chat_llm.py", line 82, in _invoke_json
    result = await generate()
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\graphrag\llm\openai\openai_chat_llm.py", line 74, in generate
    await self._native_json(input, **{**kwargs, "name": call_name})
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\graphrag\llm\openai\openai_chat_llm.py", line 108, in _native_json
    json_output = try_parse_json_object(raw_output)
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\site-packages\graphrag\llm\openai\utils.py", line 93, in try_parse_json_object
    result = json.loads(input)
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\10400\miniconda3\envs\graphrag_test\lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
10:54:04,394 graphrag.index.reporting.file_workflow_callbacks INFO Community Report Extraction Error details=None
10:54:04,394 graphrag.index.verbs.graph.report.strategies.graph_intelligence.run_graph_intelligence WARNING No report found for community: 71
zhouxihong1 commented 1 week ago

I also tried using Chinese text, and it generated normally with UTF-8 characters. Entities and relationships were also generated correctly, including the graph. However, the Chinese characters in the process are in Unicode format. I hope this can be optimized to normal characters, as it appears to be a character encoding warning.

KylinMountain commented 1 week ago

yeah, I am able to do Chinese network novel. You can refer the document in my weixin gongzhonghao .

我用网络小说仙逆做的 可以成功的。

xxWeiDG commented 1 week ago

yeah, I am able to do Chinese network novel. You can refer the document in my weixin gongzhonghao .

我用网络小说仙逆做的 可以成功的。

请问一下您知道这个报错怎么解决嘛 image

sipie800 commented 1 week ago

same here. It appears randomly.

Lincolnwill commented 2 days ago

下您知道这个报错怎么解决嘛

查看下你的logs,可能是大模型 Error Invoking LLM 导致ReadTimeout,最终报KeyError: 'community'