Closed YuLGitHub closed 2 months ago
runtime logs: create_base_text_units id ... n_tokens 0 e29a48285077a5769ed7d90bf591217d ... 414 1 1b6feb425657be7ab88ee1f004cd45f1 ... 1200 2 afcff7258f50a602920f9f6c5e1bc368 ... 1200 3 b072c6a4d2b52e92fb1ffb3e4ba57e46 ... 1200 4 7fe7314eee0ee1f88611ddb60576c9e3 ... 1200 .. ... ... ... 240 57329e2c9da52a470e774cdb7f186a8a ... 1200 241 c70f56c4b97b394c21ddc1490f43e742 ... 1200 242 059269c95dca42327ae48e8e3c057833 ... 1200 243 a27a6cdad961f684c20908b1fd1bc710 ... 1200 244 bad253442095d8d366221aaeb795700e ... 859
[9095 rows x 5 columns]
Received signal 1, exiting...
⠋ GraphRAG Indexer
├── Loading Input (InputFileType.text) - 1200 files loaded (0 filtered) 100% …
├── create_base_text_units
└── create_base_extracted_entities
└── Verb entity_extract ━━━━ 15% 10:58:12 2:16:43All tasks cancelled. Exiting...
Traceback (most recent call last):
File "
Routing to #657 - there may be other users who have found specific parameters and config to tune your particular model deployment.
Is there an existing issue for this?
- [x] I have searched the existing issues
- [x] I have checked #657 to validate if my issue is covered by community support
Describe the issue
We tried to use 1,200 complete legal and regulatory data for indexing and found that after running for a period of time, all tasks were automatically canceled after about 15% of the entities were extracted.
Steps to reproduce
This problem is triggered when I use a large amount of input data. You can try
GraphRAG Config Used
encoding_model: cl100k_base skip_workflows: [] llm: api_key: ollama type: openai_chat # or azure_openai_chat model: qwen2-7b-instruct:q4_0 model_supports_json: true api_base: http://127.0.0.1:11434/v1 concurrent_requests: 15 top_p: 0.9 # recommended if this is available for your model. # max_tokens: 4000 # request_timeout: 180.0 # api_base: https://<instance>.openai.azure.com # api_version: 2024-02-15-preview # organization: <organization_id> # deployment_name: <azure_model_deployment_name> # tokens_per_minute: 150_000 # set a leaky bucket throttle # requests_per_minute: 10_000 # set a leaky bucket throttle # max_retries: 10 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 25 # the number of parallel inflight requests that may be made # temperature: 0 # temperature for sampling # top_p: 1 # top-p sampling # n: 1 # Number of completions to generate parallelization: stagger: 0.3 # num_threads: 50 # the number of threads to use for parallel processing async_mode: threaded # or asyncio embeddings: ## parallelization: override the global parallelization settings for embeddings async_mode: threaded # or asyncio llm: api_key: xinference type: openai_embedding # or azure_openai_embedding model: bge-large-zh-v1.5 api_base: http://127.0.0.1:9997/v1 # api_base: https://<instance>.openai.azure.com # api_version: 2024-02-15-preview # organization: <organization_id> # deployment_name: <azure_model_deployment_name> # tokens_per_minute: 150_000 # set a leaky bucket throttle # requests_per_minute: 10_000 # set a leaky bucket throttle # max_retries: 10 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 25 # the number of parallel inflight requests that may be made # batch_size: 16 # the number of documents to send in a single request # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request # target: required # or optional chunks: size: 1200 overlap: 100 group_by_columns: [id] # by default, we don't allow chunks to cross documents input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\\.txt$" cache: type: file # or blob base_dir: "cache" # connection_string: <azure_blob_storage_connection_string> # container_name: <azure_blob_storage_container_name> storage: type: file # or blob base_dir: "output/${timestamp}/artifacts" # connection_string: <azure_blob_storage_connection_string> # container_name: <azure_blob_storage_container_name> reporting: type: file # or console, blob base_dir: "output/${timestamp}/reports" # connection_string: <azure_blob_storage_connection_string> # container_name: <azure_blob_storage_container_name> entity_extraction: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task prompt: "prompts/entity_extraction.txt" entity_types: [organization,person,geo,event] max_gleanings: 1 summarize_descriptions: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task prompt: "prompts/summarize_descriptions.txt" max_length: 500 claim_extraction: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task # enabled: true prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 1 community_reports: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task prompt: "prompts/community_report.txt" max_length: 2000 max_input_length: 8000 cluster_graph: max_cluster_size: 10 embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes # num_walks: 10 # walk_length: 40 # window_size: 2 # iterations: 3 # random_seed: 597832 umap: enabled: false # if true, will generate UMAP embeddings for nodes snapshots: graphml: false raw_entities: false top_level_nodes: false local_search: # text_unit_prop: 0.5 # community_prop: 0.1 # conversation_history_max_turns: 5 # top_k_mapped_entities: 10 # top_k_relationships: 10 # llm_temperature: 0 # temperature for sampling # llm_top_p: 1 # top-p sampling # llm_n: 1 # Number of completions to generate # max_tokens: 12000 global_search: # llm_temperature: 0 # temperature for sampling # llm_top_p: 1 # top-p sampling # llm_n: 1 # Number of completions to generate # max_tokens: 12000 # data_max_tokens: 12000 # map_max_tokens: 1000 # reduce_max_tokens: 2000 # concurrency: 32
Logs and screenshots
17:51:36,610 httpx INFO HTTP Request: POST http://127.0.0.1:11434/v1/chat/completions "HTTP/1.1 200 OK" 17:51:36,611 root ERROR error extracting graph Traceback (most recent call last): File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/index/graph/extractors/graph/graph_extractor.py", line 123, in call result = await self._process_document(text, prompt_variables) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/index/graph/extractors/graph/graph_extractor.py", line 151, in _process_document response = await self._llm( ^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/json_parsing_llm.py", line 34, in call result = await self._delegate(input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/openai_token_replacing_llm.py", line 37, in call return await self._delegate(input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/openai_history_tracking_llm.py", line 33, in call output = await self._delegate(input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/base/caching_llm.py", line 96, in call result = await self._delegate(input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 180, in call output_tokens = self.count_response_tokens(result.output) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 102, in count_response_tokens return self._count_tokens(output) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/utils.py", line 44, in return lambda s: len(enc.encode(s)) ^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/tiktoken/core.py", line 117, in encode raise_disallowed_special_token(match.group()) File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/tiktoken/core.py", line 400, in raise_disallowed_special_token raise ValueError( ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'. If you want this text to be encoded as a special token, pass it to
allowed_special
, e.g.allowed_special={'<|endoftext|>', ...}
. If you want this text to be encoded as normal text, disable the check for this token by passingdisallowed_special=(enc.special_tokens_set - {'<|endoftext|>'})
. To disable this check for all special tokens, passdisallowed_special=()
.17:51:36,612 graphrag.index.reporting.file_workflow_callbacks INFO Entity Extraction Error details={'doc_index': 0, 'text': '依法追究法律责任。\n第六章 附则\n第四十八条 \n中国证监会及其派出机构办理诚信信息查询,除可以收取打印、复制、装订、邮寄成本费用外,不得收取其他费用。\n第四十九条 \n证券期货市场行业组织在履行自律管理职责中,查询诚信档案,实施诚信约束、激励的,参照本办法有关规定执行。\n第五十条 \n本办法自2018年7月1日起施行。《证券期货市场诚信监督管理暂行办法》(证监会令第106号)同时废止。'} 17:51:38,159 httpx INFO HTTP Request: POST http://127.0.0.1:11434/v1/chat/completions "HTTP/1.1 200 OK" 17:51:38,160 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "Process" with 0 retries took 1.5479999999515712. input_tokens=2935, output_tokens=49 17:51:39,809 httpx INFO HTTP Request: POST http://127.0.0.1:11434/v1/chat/completions "HTTP/1.1 200 OK" 17:51:39,810 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "extract-continuation-0" with 0 retries took 4.1019999999552965. input_tokens=34, output_tokens=106
Additional Information
- GraphRAG Version: v0.2.2
- Operating System: Linux
- Python Version: 3.11
- Related Issues: None
Hello. Did you solve this? I encountered same problem.
Is there an existing issue for this?
Describe the issue
We tried to use 1,200 complete legal and regulatory data for indexing and found that after running for a period of time, all tasks were automatically canceled after about 15% of the entities were extracted.
Steps to reproduce
This problem is triggered when I use a large amount of input data. You can try
GraphRAG Config Used
Logs and screenshots
17:51:36,610 httpx INFO HTTP Request: POST http://127.0.0.1:11434/v1/chat/completions "HTTP/1.1 200 OK" 17:51:36,611 root ERROR error extracting graph Traceback (most recent call last): File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/index/graph/extractors/graph/graph_extractor.py", line 123, in call result = await self._process_document(text, prompt_variables) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/index/graph/extractors/graph/graph_extractor.py", line 151, in _process_document response = await self._llm( ^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/json_parsing_llm.py", line 34, in call result = await self._delegate(input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/openai_token_replacing_llm.py", line 37, in call return await self._delegate(input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/openai_history_tracking_llm.py", line 33, in call output = await self._delegate(input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/base/caching_llm.py", line 96, in call result = await self._delegate(input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 180, in call output_tokens = self.count_response_tokens(result.output) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 102, in count_response_tokens return self._count_tokens(output) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/utils.py", line 44, in
return lambda s: len(enc.encode(s))
^^^^^^^^^^^^^
File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/tiktoken/core.py", line 117, in encode
raise_disallowed_special_token(match.group())
File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/tiktoken/core.py", line 400, in raise_disallowed_special_token
raise ValueError(
ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'.
If you want this text to be encoded as a special token, pass it to
allowed_special
, e.g.allowed_special={'<|endoftext|>', ...}
. If you want this text to be encoded as normal text, disable the check for this token by passingdisallowed_special=(enc.special_tokens_set - {'<|endoftext|>'})
. To disable this check for all special tokens, passdisallowed_special=()
.17:51:36,612 graphrag.index.reporting.file_workflow_callbacks INFO Entity Extraction Error details={'doc_index': 0, 'text': '依法追究法律责任。\n第六章 附则\n第四十八条 \n中国证监会及其派出机构办理诚信信息查询,除可以收取打印、复制、装订、邮寄成本费用外,不得收取其他费用。\n第四十九条 \n证券期货市场行业组织在履行自律管理职责中,查询诚信档案,实施诚信约束、激励的,参照本办法有关规定执行。\n第五十条 \n本办法自2018年7月1日起施行。《证券期货市场诚信监督管理暂行办法》(证监会令第106号)同时废止。'} 17:51:38,159 httpx INFO HTTP Request: POST http://127.0.0.1:11434/v1/chat/completions "HTTP/1.1 200 OK" 17:51:38,160 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "Process" with 0 retries took 1.5479999999515712. input_tokens=2935, output_tokens=49 17:51:39,809 httpx INFO HTTP Request: POST http://127.0.0.1:11434/v1/chat/completions "HTTP/1.1 200 OK" 17:51:39,810 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "extract-continuation-0" with 0 retries took 4.1019999999552965. input_tokens=34, output_tokens=106
Additional Information