Closed minglong-huang closed 3 months ago
same issue
这是来自QQ邮箱的假期自动回复邮件。 您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
same issue 我解决了,参考#603 I'm getting an error with the create_final_community_reports step on Chinese text
Routing to #657
same issue 我解决了,参考#603 I'm getting an error with the create_final_community_reports step on Chinese text how do you fix this issue?
same issue 我解决了,参考#603 I'm getting an error with the create_final_community_reports step on Chinese text how do you fix this issue?
graphrag/llm/openai/utils.py code has been modified as follows:
def try_parse_json_object(input: str) -> dict:
"""Generate JSON-string output using best-attempt prompting & parsing techniques."""
try:
clean_json = clean_up_json(input)
result = json.loads(clean_json)
except json.JSONDecodeError:
log.exception("error loading json, json=%s", input)
raise
else:
if not isinstance(result, dict):
raise TypeError
return result
def clean_up_json(json_str: str) -> str:
"""Clean up json string."""
json_str = (
json_str.replace("\\n", "")
.replace("\n", "")
.replace("\r", "")
.replace('"[{', "[{")
.replace('}]"', "}]")
.replace("\\", "")
# Refer: graphrag\llm\openai\_json.py,graphrag\index\utils\json.py
.replace("{{", "{")
.replace("}}", "}")
.strip()
)
# Remove JSON Markdown Frame
if json_str.startswith("```json"):
json_str = json_str[len("```json"):]
if json_str.endswith("```"):
json_str = json_str[: len(json_str) - len("```")]
return json_str
same issue 我解决了,参考#603 I'm getting an error with the create_final_community_reports step on Chinese text how do you fix this issue?
graphrag/llm/openai/utils.py code has been modified as follows:
def try_parse_json_object(input: str) -> dict: """Generate JSON-string output using best-attempt prompting & parsing techniques.""" try: clean_json = clean_up_json(input) result = json.loads(clean_json) except json.JSONDecodeError: log.exception("error loading json, json=%s", input) raise else: if not isinstance(result, dict): raise TypeError return result def clean_up_json(json_str: str) -> str: """Clean up json string.""" json_str = ( json_str.replace("\\n", "") .replace("\n", "") .replace("\r", "") .replace('"[{', "[{") .replace('}]"', "}]") .replace("\\", "") # Refer: graphrag\llm\openai\_json.py,graphrag\index\utils\json.py .replace("{{", "{") .replace("}}", "}") .strip() ) # Remove JSON Markdown Frame if json_str.startswith("```json"): json_str = json_str[len("```json"):] if json_str.endswith("```"): json_str = json_str[: len(json_str) - len("```")] return json_str
After changing the code, you have rebuild the Index. Then it works?
same issue 我解决了,参考#603 I'm getting an error with the create_final_community_reports step on Chinese text how do you fix this issue?
graphrag/llm/openai/utils.py code has been modified as follows:
def try_parse_json_object(input: str) -> dict: """Generate JSON-string output using best-attempt prompting & parsing techniques.""" try: clean_json = clean_up_json(input) result = json.loads(clean_json) except json.JSONDecodeError: log.exception("error loading json, json=%s", input) raise else: if not isinstance(result, dict): raise TypeError return result def clean_up_json(json_str: str) -> str: """Clean up json string.""" json_str = ( json_str.replace("\\n", "") .replace("\n", "") .replace("\r", "") .replace('"[{', "[{") .replace('}]"', "}]") .replace("\\", "") # Refer: graphrag\llm\openai\_json.py,graphrag\index\utils\json.py .replace("{{", "{") .replace("}}", "}") .strip() ) # Remove JSON Markdown Frame if json_str.startswith("```json"): json_str = json_str[len("```json"):] if json_str.endswith("```"): json_str = json_str[: len(json_str) - len("```")] return json_str
After changing the code, you have rebuild the Index. Then it works? of course,I use GLM-4 and xinference
embedding
do you know which local embedding model is the best?
Is there an existing issue for this?
Describe the issue
我刚搭建完RAG,正分别尝试使用Local Search Response和Global Search的效果,但是两个方法都未能成功(I just set up RAG and am trying to use both Local Search Response and Global Search methods separately, but neither method has been successful)
一、Local Search: 在graphrag\query\structured_search\local_search\search.py中 我通过debug发现
response = self.llm.generate( messages=search_messages, streaming=True, callbacks=self.callbacks, self.llm_params, )中response 是空的字符串(response=‘’),但是search_messages是正确的内容 二、Global Search 在graphrag\query\structured_search\global_search\search.py中 async with self.semaphore: search_response = await self.llm.agenerate( messages=search_messages, streaming=False, llm_kwargs )
->search_response= ‘’ ->json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Steps to reproduce
No response
GraphRAG Config Used
encoding_model: cl100k_base skip_workflows: [] llm: api_key: glm - 4 type: openai_chat # or azure_openai_chat model: glm - 4 - 9 b - chat model_supports_json: true # recommended if this is available for your model.
max_tokens: 4000
request_timeout: 180.0
api_base: http: // 0.0 .0 .0: 8081 / v1
api_version: 2024-02-15-preview
organization:
deployment_name:
tokens_per_minute: 150_000 # set a leaky bucket throttle
requests_per_minute: 10_000 # set a leaky bucket throttle
max_retries: 10
max_retry_wait: 10.0
sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
concurrent_requests: 25 # the number of parallel inflight requests that may be made
temperature: 0 # temperature for sampling
top_p: 0.8 # top-p sampling
n: 1 # Number of completions to generate
parallelization: stagger: 0.3
num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio llm: api_key: xinference type: openai_embedding # or azure_openai_embedding model: bce - embedding - base_v1 api_base: http: // 192.0 .0 .181: 9997 / v1
api_version: 2024-02-15-preview
organization:
deployment_name:
tokens_per_minute: 150_000 # set a leaky bucket throttle
requests_per_minute: 10_000 # set a leaky bucket throttle
max_retries: 10
max_retry_wait: 10.0
sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
concurrent_requests: 25 # the number of parallel inflight requests that may be made
batch_size: 16 # the number of documents to send in a single request
batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
target: required # or optional
chunks: size: 1200 overlap: 100 group_by_columns: [id] # by default, we don't allow chunks to cross documents
input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf - 8 file_pattern: ".*\.txt$"
cache: type: file # or blob base_dir: "cache"
connection_string:
container_name:
storage: type: file # or blob base_dir: "output/${timestamp}/artifacts"
connection_string:
container_name:
reporting: type: file # or console, blob base_dir: "output/${timestamp}/reports"
connection_string:
container_name:
entity_extraction:
llm: override the global llm settings for this task
parallelization: override the global parallelization settings for this task
async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt" entity_types: [organization, person, geo, event] max_gleanings: 1
summarize_descriptions:
llm: override the global llm settings for this task
parallelization: override the global parallelization settings for this task
async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt" max_length: 500
claim_extraction:
llm: override the global llm settings for this task
parallelization: override the global parallelization settings for this task
async_mode: override the global async_mode settings for this task
enabled: true
prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 1
community_reports:
llm: override the global llm settings for this task
parallelization: override the global parallelization settings for this task
async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt" max_length: 2000 max_input_length: 8000
cluster_graph: max_cluster_size: 10
embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes
num_walks: 10
walk_length: 40
window_size: 2
iterations: 3
random_seed: 597832
umap: enabled: false # if true, will generate UMAP embeddings for nodes
snapshots: graphml: false raw_entities: false top_level_nodes: false
local_search:
text_unit_prop: 0.5
community_prop: 0.1
conversation_history_max_turns: 5
top_k_mapped_entities: 10
top_k_relationships: 10
max_tokens: 12000
global_search: max_tokens: 6000 data_max_tokens: 6000
map_max_tokens: 1000
reduce_max_tokens: 2000
concurrency: 32
Logs and screenshots
1、local 2、global
Additional Information