microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
16.83k stars 1.58k forks source link

[Bug]: json.decoder.JSONDecodeError when generating Community reports #604

Closed Amitabh-Priyadarshi-Bayer closed 1 month ago

Amitabh-Priyadarshi-Bayer commented 1 month ago

Describe the bug

json={{ "title": "Product Team: Mansz and Jrman" {{ is giving error. I tried to fix the system message for community report. but I found out the error still persists and when I looked into report then it shows that community_report prompt is null.

in setting.yaml, prompt filename for community report is "prompts/community_report.txt" I updated double braces '{{' to single { in "community_report.txt" but it still creating the json with double '{{'.

community_report:
  prompt: "prompts/community_report.txt"
  max_length: 4000
  max_input_length: 12000

Also, in indexing-engine.log in "community_reports" section, its showing "prompt": null and not showing the filename as 'prompts/community_report.txt', which is mentioned in setting.yaml

"community_reports": 
      "async_mode": "threaded",
      "prompt": null,
      "max_length": 2000,
      "max_input_length": 8000,
      "strategy": null

Steps to reproduce

No response

Expected Behavior

No response

GraphRAG Config Used

encoding_model: cl100k_base skip_workflows: [] llm: api_key: ${GRAPHRAG_API_KEY} type: azure_openai_chat # or azure_openai_chat model: gpt-4-32k (0613) model_supports_json: false

max_tokens: 4000

request_timeout: 180.0

api_base: -removed because of security purpose api_version: '2023-05-15'

organization:

deployment_name: gpt-4-32k

tokens_per_minute: 150_000 # set a leaky bucket throttle

requests_per_minute: 10_000 # set a leaky bucket throttle

max_retries: 10

max_retry_wait: 10.0

sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times

concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization: stagger: 0.3

num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:

parallelization: override the global parallelization settings for embeddings

async_mode: threaded # or asyncio llm: api_key: ${GRAPHRAG_API_KEY} type: azure_openai_embedding model: text-embedding-ada-002 api_base: removed because of security purpose api_version: '2023-05-15'

organization:

deployment_name: embedding
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional

chunks: size: 300 overlap: 100 group_by_columns: [id] # by default, we don't allow chunks to cross documents

input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\.txt$"

cache: type: file # or blob base_dir: "cache"

connection_string:

container_name:

storage: type: file # or blob base_dir: "output/${timestamp}/artifacts"

connection_string:

container_name:

reporting: type: file # or console, blob base_dir: "output/${timestamp}/reports"

connection_string:

container_name:

entity_extraction:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/entity_extraction.txt" entity_types: [organization,person,geo,event] max_gleanings: 0

summarize_descriptions:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/summarize_descriptions.txt" max_length: 500

claim_extraction:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

enabled: true prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 0

community_report:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/community_report.txt" max_length: 4000 max_input_length: 12000

cluster_graph: max_cluster_size: 10

embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes

num_walks: 10

walk_length: 40

window_size: 2

iterations: 3

random_seed: 597832

umap: enabled: false # if true, will generate UMAP embeddings for nodes

snapshots: graphml: false raw_entities: false top_level_nodes: false

local_search:

text_unit_prop: 0.5

community_prop: 0.1

conversation_history_max_turns: 5

top_k_mapped_entities: 10

top_k_relationships: 10

max_tokens: 12000

global_search:

max_tokens: 12000

data_max_tokens: 12000

map_max_tokens: 1000

reduce_max_tokens: 2000

concurrency: 32

Logs and screenshots

20:19:16,982 graphrag.config.read_dotenv INFO Loading pipeline .env file 20:19:16,988 graphrag.index.cli INFO using default configuration: { "llm": { "api_key": "REDACTED, length 32", "type": "azure_openai_chat", "model": "gpt-4-32k (0613)", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "removed because of security purpose", "api_version": "2023-05-15", "proxy": null, "cognitive_services_endpoint": null, "deployment_name": "gpt-4-32k", "model_supports_json": false, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "root_dir": "GraphRAG/", "reporting": { "type": "file", "base_dir": "output/${timestamp}/reports", "storage_account_blob_url": null }, "storage": { "type": "file", "base_dir": "output/${timestamp}/artifacts", "storage_account_blob_url": null }, "cache": { "type": "file", "base_dir": "cache", "storage_account_blob_url": null }, "input": { "type": "file", "file_type": "text", "base_dir": "input", "storage_account_blob_url": null, "encoding": "utf-8", "file_pattern": ".*\.txt$", "file_filter": null, "source_column": null, "timestamp_column": null, "timestamp_format": null, "text_column": "text", "title_column": null, "document_attribute_columns": [] }, "embed_graph": { "enabled": false, "num_walks": 10, "walk_length": 40, "window_size": 2, "iterations": 3, "random_seed": 597832, "strategy": null }, "embeddings": { "llm": { "api_key": "REDACTED, length 32", "type": "azure_openai_embedding", "model": "text-embedding-ada-002", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "removed because of security purpose", "api_version": "2023-05-15", "proxy": null, "cognitive_services_endpoint": null, "deployment_name": "embedding", "model_supports_json": null, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "batch_size": 16, "batch_max_tokens": 8191, "target": "required", "skip": [], "vector_store": null, "strategy": null }, "chunks": { "size": 300, "overlap": 100, "group_by_columns": [ "id" ], "strategy": null }, "snapshots": { "graphml": false, "raw_entities": false, "top_level_nodes": false }, "entity_extraction": { "llm": { "api_key": "REDACTED, length 32", "type": "azure_openai_chat", "model": "gpt-4-32k (0613)", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "removed because of security purpose", "api_version": "2023-05-15", "proxy": null, "cognitive_services_endpoint": null, "deployment_name": "gpt-4-32k", "model_supports_json": false, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "prompt": "prompts/entity_extraction.txt", "entity_types": [ "organization", "person", "geo", "event" ], "max_gleanings": 0, "strategy": null }, "summarize_descriptions": { "llm": { "api_key": "REDACTED, length 32", "type": "azure_openai_chat", "model": "gpt-4-32k (0613)", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "removed because of security purpose", "api_version": "2023-05-15", "proxy": null, "cognitive_services_endpoint": null, "deployment_name": "gpt-4-32k", "model_supports_json": false, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "prompt": "prompts/summarize_descriptions.txt", "max_length": 500, "strategy": null }, "community_reports": { "llm": { "api_key": "REDACTED, length 32", "type": "azure_openai_chat", "model": "gpt-4-32k (0613)", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "removed because of security purpose", "api_version": "2023-05-15", "proxy": null, "cognitive_services_endpoint": null, "deployment_name": "gpt-4-32k", "model_supports_json": false, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "prompt": null, "max_length": 2000, "max_input_length": 8000, "strategy": null }, "claim_extraction": { "llm": { "api_key": "REDACTED, length 32", "type": "azure_openai_chat", "model": "gpt-4-32k (0613)", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "removed because of security purpose", "api_version": "2023-05-15", "proxy": null, "cognitive_services_endpoint": null, "deployment_name": "gpt-4-32k", "model_supports_json": false, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "enabled": true, "prompt": "prompts/claim_extraction.txt", "description": "Any claims or facts that could be relevant to information discovery.", "max_gleanings": 0, "strategy": null }, "cluster_graph": { "max_cluster_size": 10, "strategy": null }, "umap": { "enabled": false }, "local_search": { "text_unit_prop": 0.5, "community_prop": 0.1, "conversation_history_max_turns": 5, "top_k_entities": 10, "top_k_relationships": 10, "max_tokens": 12000, "llm_max_tokens": 2000 }, "global_search": { "max_tokens": 12000, "data_max_tokens": 12000, "map_max_tokens": 1000, "reduce_max_tokens": 2000, "concurrency": 32 }, "encoding_model": "cl100k_base", "skip_workflows": [] }

20:20:39,273 graphrag.index.reporting.file_workflow_callbacks INFO Community Report Extraction Error details=None 20:20:39,273 graphrag.index.verbs.graph.report.strategies.graph_intelligence.run_graph_intelligence WARNING No report found for community: 0 20:20:39,346 httpx INFO HTTP Request: POST --" 20:20:39,347 graphrag.llm.openai.utils ERROR error loading json, json={{ "title": "Application Support Team and Controlled Environment", "summary": "The community revolves around the Application Support Team, which provides assistance to users experiencing problems with the application. The team interacts with various features of the application, including the Controlled Environment, Admin Tab, In-app Support Ticket System, Statuses File, and Summary View.", "rating": 7.0, "rating_explanation": "The impact severity rating is high due to the critical role of the Application Support Team in ensuring smooth operation of the application.", "findings": [ {{ "summary": "Functionality of the Summary View", "explanation": "The Summary View is a customizable section of the application where users can adjust the display of information. The Application Support Team can provide assistance for customizing the Summary View, indicating its complexity and potential for user customization. [Data: Entities (26), Relationships (37)]" }} ]}} Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/graphrag/llm/openai/utils.py", line 93, in try_parse_json_object result = json.loads(input) File "/opt/conda/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/opt/conda/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/opt/conda/lib/python3.10/json/decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1) 20:20:39,349 graphrag.llm.openai.openai_chat_llm WARNING error parsing llm json, retrying 20:20:39,978 httpx INFO HTTP Request: POST https://agvisorapimtest.azure-api.net/openapi-test/openai/deployments/gpt-4-32k/chat/completions?api-version=2023-05-15 "HTTP/1.1 200 OK" 20:20:39,980 graphrag.llm.openai.utils ERROR error loading json, json={output_text} Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/graphrag/llm/openai/openai_chat_llm.py", line 124, in _manual_json json_output = try_parse_json_object(output) File "/opt/conda/lib/python3.10/site-packages/graphrag/llm/openai/utils.py", line 93, in try_parse_json_object result = json.loads(input) File "/opt/conda/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/opt/conda/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/opt/conda/lib/python3.10/json/decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Additional Information

WeiminLee commented 1 month ago

I also encountered this error, I solved this bug by change the default summary report prompt, In which I change the "{{" and "}}" with "{" and "}"

image

Amitabh-Priyadarshi-Bayer commented 1 month ago

@WeiminLee I did that. but the graph rag still fetching the old summary_report inside the code instead of the local changed summary_report.txt file that is mentioned in my settings.yaml.

WeiminLee commented 1 month ago

@WeiminLee I did that. but the graph rag still fetching the old summary_report inside the code instead of the local changed summary_report.txt file that is mentioned in my settings.yaml.

It's a bug, so you should change the prompt inside the Package(graphrag/index/graph/extractors/graph/prompts.py)

sparkfyb commented 1 month ago

我做到了。但是图形抹布仍然在代码中获取旧summary_report,而不是我的settings.yaml中提到的本地更改summary_report.txt文件。

这是一个错误,所以你应该更改包内的提示(graphrag/index/graph/extractors/graph/prompts.py)

Actually it may should be "graphrag/index/graph/extractors/community_reports/prompts.py", anyway thank you!

Amitabh-Priyadarshi-Bayer commented 1 month ago

我做到了。但是图形抹布仍然在代码中获取旧summary_report,而不是我的settings.yaml中提到的本地更改summary_report.txt文件。

这是一个错误,所以你应该更改包内的提示(graphrag/index/graph/extractors/graph/prompts.py)

Actually it may should be "graphrag/index/graph/extractors/community_reports/prompts.py", anyway thank you!

I am using: python -m graphrag.index --init --root GraphRAG/ , So, GraphRAG is my local folder which contains the prompts/community_reports.txt which you can edit for customized system message.

took your advice and changed the graphrag/index/graph/extractors/community_reports/prompts.py file in my system at location : /opt/conda/lib/python3.10/site-packages/graphrag/index/graph/extractors/community_reports.

now its working for me.

Archdiner commented 1 month ago

@WeiminLee I did that. but the graph rag still fetching the old summary_report inside the code instead of the local changed summary_report.txt file that is mentioned in my settings.yaml.

It's a bug, so you should change the prompt inside the Package(graphrag/index/graph/extractors/graph/prompts.py)

Is there any chance you could show the original prompt you had and what you then changed it to?

Amitabh-Priyadarshi-Bayer commented 1 month ago

@Archdiner in graphrag/index/graph/extractors/community_reports/prompts.py all occurrence of {{ to { and }} to }.

for example:

 {{
        "title": <report_title>,
        "summary": <executive_summary>,
        "rating": <impact_severity_rating>,
        "rating_explanation": <rating_explanation>,
        "findings": [
            {{
                "summary":<insight_1_summary>,
                "explanation": <insight_1_explanation>
            }},
            {{
                "summary":<insight_2_summary>,
                "explanation": <insight_2_explanation>
            }}
        ]
    }}

> to following.

{
        "title": <report_title>,
        "summary": <executive_summary>,
        "rating": <impact_severity_rating>,
        "rating_explanation": <rating_explanation>,
        "findings": [
            {
                "summary":<insight_1_summary>,
                "explanation": <insight_1_explanation>
            },
            {
                "summary":<insight_2_summary>,
                "explanation": <insight_2_explanation>
            }
        ]
    }
galen1980guo commented 1 month ago

13:56:11,978 datashaper.workflow.workflow INFO executing verb create_community_reports
13:56:33,992 httpx INFO HTTP Request: POST http://10.110.0.25:11434/v1/chat/completions "HTTP/1.1 200 OK"
13:56:33,998 graphrag.llm.openai.utils ERROR error loading json, json=Here is the output in JSON format:```{    "title": "Baidu Community",    "summary": "The Baidu community revolves around Robin Li, the founder and CEO of Baidu, a Chinese technology company focused on AI applications and research. The community's dynamics are shaped by Robin Li's concerns about AI development risks and his leadership role in Baidu.",    "rating": 6.0,    "rating_explanation": "The impact severity rating is moderate due to the potential influence of Robin Li's views on AI development and Baidu's prominent position in the tech industry.",    "findings": [        {            "summary": "Robin Li's leadership role in Baidu",            "explanation": "Robin Li is the founder, chairman, and CEO of Baidu, indicating his significant influence over the company's direction and decisions. This leadership role is crucial in understanding the community's dynamics [Data: Entities (21); Relationships (24)]."        },        {            "summary": "Baidu's focus on AI applications and research",            "explanation": "Baidu is a Chinese technology company focused on both AI applications and research, suggesting its significant contribution to the development of AI in China. This focus could have implications for the community's dynamics [Data: Entities (22)]."        },        {            "summary": "Robin Li's concerns about AI development risks",            "explanation": "Robin Li is concerned about the risks associated with AI development, which could impact society. This concern suggests that he may be cautious in his approach to AI development and deployment [Data: Relationships (5)]."        }    ]}```Let me know if you need any further assistance!
Traceback (most recent call last):
  File "/data/galen_guo/workspace/LLM-Research/graphrag/graphrag/llm/openai/utils.py", line 93, in try_parse_json_object
    result = json.loads(input)
             ^^^^^^^^^^^^^^^^^
  File "/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/galen.guo/miniforge3/envs/rag_env/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
13:56:33,999 graphrag.llm.openai.openai_chat_llm WARNING error parsing llm json, retrying

...
I am encountering the same issue. I modified prompts/community_report.txt and graphrag/index/graph/extractors/community_reports/prompts.py, but the changes did not take effect. I ran the command using poetry run poe index --root .. Does this relate to the problem we discussed earlier?
galen1980guo commented 1 month ago

@WeiminLee @Archdiner @Amitabh-Priyadarshi-Bayer Could you please help confirm this? I would be very grateful!

ehrgeizig commented 3 weeks ago

@WeiminLee @Archdiner @Amitabh-Priyadarshi-Bayer I am having this error: json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 327 (char 326) What should I do?