microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
18.9k stars 1.85k forks source link

[Issue]: More input data(1200 files, 27MB): Verb entity_extract All tasks cancelled. #894

Closed YuLGitHub closed 2 months ago

YuLGitHub commented 3 months ago

Is there an existing issue for this?

Describe the issue

We tried to use 1,200 complete legal and regulatory data for indexing and found that after running for a period of time, all tasks were automatically canceled after about 15% of the entities were extracted. tasks_cancelled

Steps to reproduce

This problem is triggered when I use a large amount of input data. You can try

GraphRAG Config Used

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ollama
  type: openai_chat # or azure_openai_chat
  model: qwen2-7b-instruct:q4_0
  model_supports_json: true 
  api_base: http://127.0.0.1:11434/v1
  concurrent_requests: 15
  top_p: 0.9
  # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: xinference
    type: openai_embedding # or azure_openai_embedding
    model: bge-large-zh-v1.5
    api_base: http://127.0.0.1:9997/v1
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

17:51:36,610 httpx INFO HTTP Request: POST http://127.0.0.1:11434/v1/chat/completions "HTTP/1.1 200 OK" 17:51:36,611 root ERROR error extracting graph Traceback (most recent call last): File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/index/graph/extractors/graph/graph_extractor.py", line 123, in call result = await self._process_document(text, prompt_variables) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/index/graph/extractors/graph/graph_extractor.py", line 151, in _process_document response = await self._llm( ^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/json_parsing_llm.py", line 34, in call result = await self._delegate(input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/openai_token_replacing_llm.py", line 37, in call return await self._delegate(input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/openai_history_tracking_llm.py", line 33, in call output = await self._delegate(input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/base/caching_llm.py", line 96, in call result = await self._delegate(input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 180, in call output_tokens = self.count_response_tokens(result.output) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 102, in count_response_tokens return self._count_tokens(output) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/utils.py", line 44, in return lambda s: len(enc.encode(s)) ^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/tiktoken/core.py", line 117, in encode raise_disallowed_special_token(match.group()) File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/tiktoken/core.py", line 400, in raise_disallowed_special_token raise ValueError( ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'. If you want this text to be encoded as a special token, pass it to allowed_special, e.g. allowed_special={'<|endoftext|>', ...}. If you want this text to be encoded as normal text, disable the check for this token by passing disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'}). To disable this check for all special tokens, pass disallowed_special=().

17:51:36,612 graphrag.index.reporting.file_workflow_callbacks INFO Entity Extraction Error details={'doc_index': 0, 'text': '依法追究法律责任。\n第六章 附则\n第四十八条 \n中国证监会及其派出机构办理诚信信息查询,除可以收取打印、复制、装订、邮寄成本费用外,不得收取其他费用。\n第四十九条 \n证券期货市场行业组织在履行自律管理职责中,查询诚信档案,实施诚信约束、激励的,参照本办法有关规定执行。\n第五十条 \n本办法自2018年7月1日起施行。《证券期货市场诚信监督管理暂行办法》(证监会令第106号)同时废止。'} 17:51:38,159 httpx INFO HTTP Request: POST http://127.0.0.1:11434/v1/chat/completions "HTTP/1.1 200 OK" 17:51:38,160 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "Process" with 0 retries took 1.5479999999515712. input_tokens=2935, output_tokens=49 17:51:39,809 httpx INFO HTTP Request: POST http://127.0.0.1:11434/v1/chat/completions "HTTP/1.1 200 OK" 17:51:39,810 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "extract-continuation-0" with 0 retries took 4.1019999999552965. input_tokens=34, output_tokens=106

Additional Information

YuLGitHub commented 3 months ago

runtime logs: create_base_text_units id ... n_tokens 0 e29a48285077a5769ed7d90bf591217d ... 414 1 1b6feb425657be7ab88ee1f004cd45f1 ... 1200 2 afcff7258f50a602920f9f6c5e1bc368 ... 1200 3 b072c6a4d2b52e92fb1ffb3e4ba57e46 ... 1200 4 7fe7314eee0ee1f88611ddb60576c9e3 ... 1200 .. ... ... ... 240 57329e2c9da52a470e774cdb7f186a8a ... 1200 241 c70f56c4b97b394c21ddc1490f43e742 ... 1200 242 059269c95dca42327ae48e8e3c057833 ... 1200 243 a27a6cdad961f684c20908b1fd1bc710 ... 1200 244 bad253442095d8d366221aaeb795700e ... 859

[9095 rows x 5 columns] Received signal 1, exiting... ⠋ GraphRAG Indexer ├── Loading Input (InputFileType.text) - 1200 files loaded (0 filtered) 100% … ├── create_base_text_units └── create_base_extracted_entities └── Verb entity_extract ━━━━ 15% 10:58:12 2:16:43All tasks cancelled. Exiting... Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/index/main.py", line 76, in index_cli( File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/index/cli.py", line 161, in index_cli _run_workflow_async() File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/index/cli.py", line 154, in _run_workflow_async runner.run(execute()) File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/index/cli.py", line 123, in execute async for output in run_pipeline_with_config( File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/index/run.py", line 154, in run_pipeline_with_config async for table in run_pipeline( File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/index/run.py", line 323, in run_pipeline result = await workflow.run(context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/datashaper/workflow/workflow.py", line 369, in run timing = await self._execute_verb(node, context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb result = await result ^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/index/verbs/entities/extraction/entity_extract.py", line 161, in entity_extract results = await derive_from_rows( ^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/datashaper/execution/derive_from_rows.py", line 33, in derive_from_rows return await derive_from_rows_asyncio_threads( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/datashaper/execution/derive_from_rows_asyncio_threads.py", line 40, in derive_from_rows_asyncio_threads return await derive_from_rows_base(input, transform, callbacks, gather) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/datashaper/execution/derive_from_rows_base.py", line 49, in derive_from_rows_base result = await gather(execute) ^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/datashaper/execution/derive_from_rows_asyncio_threads.py", line 38, in gather return await asyncio.gather(*[execute_task(task) for task in tasks]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/datashaper/execution/derive_from_rows_asyncio_threads.py", line 33, in execute_task async with semaphore: File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/asyncio/locks.py", line 15, in aenter await self.acquire() File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/asyncio/locks.py", line 387, in acquire await fut asyncio.exceptions.CancelledError sys:1: RuntimeWarning: coroutine 'to_thread' was never awaited RuntimeWarning: Enable tracemalloc to get the object allocation traceback

natoverse commented 2 months ago

Routing to #657 - there may be other users who have found specific parameters and config to tune your particular model deployment.

worstkid92 commented 1 month ago

Is there an existing issue for this?

  • [x] I have searched the existing issues
  • [x] I have checked #657 to validate if my issue is covered by community support

Describe the issue

We tried to use 1,200 complete legal and regulatory data for indexing and found that after running for a period of time, all tasks were automatically canceled after about 15% of the entities were extracted. tasks_cancelled

Steps to reproduce

This problem is triggered when I use a large amount of input data. You can try

GraphRAG Config Used

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ollama
  type: openai_chat # or azure_openai_chat
  model: qwen2-7b-instruct:q4_0
  model_supports_json: true 
  api_base: http://127.0.0.1:11434/v1
  concurrent_requests: 15
  top_p: 0.9
  # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: xinference
    type: openai_embedding # or azure_openai_embedding
    model: bge-large-zh-v1.5
    api_base: http://127.0.0.1:9997/v1
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

17:51:36,610 httpx INFO HTTP Request: POST http://127.0.0.1:11434/v1/chat/completions "HTTP/1.1 200 OK" 17:51:36,611 root ERROR error extracting graph Traceback (most recent call last): File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/index/graph/extractors/graph/graph_extractor.py", line 123, in call result = await self._process_document(text, prompt_variables) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/index/graph/extractors/graph/graph_extractor.py", line 151, in _process_document response = await self._llm( ^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/json_parsing_llm.py", line 34, in call result = await self._delegate(input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/openai_token_replacing_llm.py", line 37, in call return await self._delegate(input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/openai_history_tracking_llm.py", line 33, in call output = await self._delegate(input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/base/caching_llm.py", line 96, in call result = await self._delegate(input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 180, in call output_tokens = self.count_response_tokens(result.output) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 102, in count_response_tokens return self._count_tokens(output) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/utils.py", line 44, in return lambda s: len(enc.encode(s)) ^^^^^^^^^^^^^ File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/tiktoken/core.py", line 117, in encode raise_disallowed_special_token(match.group()) File "/data/kimxzhang/anaconda3/envs/graphrag/lib/python3.11/site-packages/tiktoken/core.py", line 400, in raise_disallowed_special_token raise ValueError( ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'. If you want this text to be encoded as a special token, pass it to allowed_special, e.g. allowed_special={'<|endoftext|>', ...}. If you want this text to be encoded as normal text, disable the check for this token by passing disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'}). To disable this check for all special tokens, pass disallowed_special=().

17:51:36,612 graphrag.index.reporting.file_workflow_callbacks INFO Entity Extraction Error details={'doc_index': 0, 'text': '依法追究法律责任。\n第六章 附则\n第四十八条 \n中国证监会及其派出机构办理诚信信息查询,除可以收取打印、复制、装订、邮寄成本费用外,不得收取其他费用。\n第四十九条 \n证券期货市场行业组织在履行自律管理职责中,查询诚信档案,实施诚信约束、激励的,参照本办法有关规定执行。\n第五十条 \n本办法自2018年7月1日起施行。《证券期货市场诚信监督管理暂行办法》(证监会令第106号)同时废止。'} 17:51:38,159 httpx INFO HTTP Request: POST http://127.0.0.1:11434/v1/chat/completions "HTTP/1.1 200 OK" 17:51:38,160 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "Process" with 0 retries took 1.5479999999515712. input_tokens=2935, output_tokens=49 17:51:39,809 httpx INFO HTTP Request: POST http://127.0.0.1:11434/v1/chat/completions "HTTP/1.1 200 OK" 17:51:39,810 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "extract-continuation-0" with 0 retries took 4.1019999999552965. input_tokens=34, output_tokens=106

Additional Information

  • GraphRAG Version: v0.2.2
  • Operating System: Linux
  • Python Version: 3.11
  • Related Issues: None

Hello. Did you solve this? I encountered same problem.