Closed macfangyw closed 3 weeks ago
BTW, that two docx files index failed is my problem, the docx index works fine.
Hey there, @macfangyw! I'm here to help you with any bugs, questions, or contributions you have. Let me know how I can assist you!
I found a similar issue where PDF files cannot be indexed while DOCX files can in Dify version 0.7.0. The problem arises because the community version defaults to using Dify ETL, which does support PDF files, but there might be differences in file extraction effects between Dify ETL and Unstructured ETL. The SaaS version defaults to using Unstructured ETL, which supports a broader range of file formats, including PDF.
To enable PDF indexing, you may need to switch to Unstructured ETL by configuring the appropriate environment variables:
ETL_TYPE
set to Unstructured
UNSTRUCTURED_API_URL
Example configuration:
ETL_TYPE=Unstructured
UNSTRUCTURED_API_URL=http://unstructured:8000/general/v0/general
This should help resolve the issue with indexing PDF files [1][2][3][4].
To continue talking to Dosu, mention @dosu.
Please provide the container logs of api and worker.
the log for docker-api-1 --follow
Running migrations
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Preparing database migration...
Start database migration.
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
Database migration successful!
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
[2024-08-26 06:01:53 +0000] [1] [INFO] Starting gunicorn 22.0.0
[2024-08-26 06:01:53 +0000] [1] [INFO] Listening at: http://0.0.0.0:5001 (1)
[2024-08-26 06:01:53 +0000] [1] [INFO] Using worker: gevent
[2024-08-26 06:01:53 +0000] [63] [INFO] Booting worker with pid: 63
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:02:49 +0000] [1] [ERROR] Worker (pid:61) exited with code 1
[2024-08-26 06:02:49 +0000] [1] [ERROR] Worker (pid:61) exited with code 1.
[2024-08-26 06:02:49 +0000] [1] [ERROR] Worker (pid:63) was sent code 133!
[2024-08-26 06:02:49 +0000] [65] [INFO] Booting worker with pid: 65
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:03:03 +0000] [1] [ERROR] Worker (pid:65) was sent code 133!
[2024-08-26 06:03:03 +0000] [67] [INFO] Booting worker with pid: 67
[2024-08-26 06:03:29 +0000] [1] [INFO] Handling signal: term
[2024-08-26 06:03:30 +0000] [67] [INFO] Worker exiting (pid: 67)
[2024-08-26 06:03:34 +0000] [1] [INFO] Shutting down: Master
Running migrations
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Preparing database migration...
Start database migration.
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
Database migration successful!
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
[2024-08-26 06:27:28 +0000] [1] [INFO] Starting gunicorn 22.0.0
[2024-08-26 06:27:28 +0000] [1] [INFO] Listening at: http://0.0.0.0:5001 (1)
[2024-08-26 06:27:28 +0000] [1] [INFO] Using worker: gevent
[2024-08-26 06:27:28 +0000] [62] [INFO] Booting worker with pid: 62
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:28:28 +0000] [1] [ERROR] Worker (pid:60) exited with code 1
[2024-08-26 06:28:28 +0000] [1] [ERROR] Worker (pid:60) exited with code 1.
[2024-08-26 06:28:28 +0000] [1] [ERROR] Worker (pid:62) was sent code 133!
[2024-08-26 06:28:28 +0000] [64] [INFO] Booting worker with pid: 64
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:28:35 +0000] [1] [ERROR] Worker (pid:64) was sent code 133!
[2024-08-26 06:28:35 +0000] [66] [INFO] Booting worker with pid: 66
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:31:03 +0000] [1] [ERROR] Worker (pid:66) was sent code 133!
[2024-08-26 06:31:03 +0000] [68] [INFO] Booting worker with pid: 68
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:31:07 +0000] [1] [ERROR] Worker (pid:68) was sent code 133!
[2024-08-26 06:31:07 +0000] [70] [INFO] Booting worker with pid: 70
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:31:26 +0000] [1] [ERROR] Worker (pid:70) was sent code 133!
[2024-08-26 06:31:26 +0000] [72] [INFO] Booting worker with pid: 72
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:31:50 +0000] [1] [ERROR] Worker (pid:72) was sent code 133!
[2024-08-26 06:31:50 +0000] [74] [INFO] Booting worker with pid: 74
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:32:10 +0000] [1] [ERROR] Worker (pid:74) was sent code 133!
[2024-08-26 06:32:10 +0000] [76] [INFO] Booting worker with pid: 76
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:32:14 +0000] [1] [ERROR] Worker (pid:76) was sent code 133!
[2024-08-26 06:32:14 +0000] [78] [INFO] Booting worker with pid: 78
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:36:41 +0000] [1] [ERROR] Worker (pid:78) was sent code 133!
[2024-08-26 06:36:41 +0000] [80] [INFO] Booting worker with pid: 80
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:36:50 +0000] [1] [ERROR] Worker (pid:80) was sent code 133!
[2024-08-26 06:36:50 +0000] [82] [INFO] Booting worker with pid: 82
LOG for worker
[2024-08-26 06:38:01,518: INFO/MainProcess] Connected to redis://:**@redis:6379/1
[2024-08-26 06:38:01,527: INFO/MainProcess] mingle: searching for neighbors
[2024-08-26 06:38:02,546: INFO/MainProcess] mingle: all alone
[2024-08-26 06:38:02,580: INFO/MainProcess] celery@1a168805e19f ready.
[2024-08-26 06:38:02,588: INFO/MainProcess] Task tasks.document_indexing_task.document_indexing_task[83039707-b89e-4a4b-a1ab-7b67e8983ed5] received
[2024-08-26 06:38:02,797: INFO/MainProcess] Start process document: 70e55be0-3e4e-43a8-9bb9-da35be7829fa
[2024-08-26 06:38:03,049: INFO/MainProcess] Task tasks.deal_dataset_vector_index_task.deal_dataset_vector_index_task[22ad1ec2-8a5f-4895-bc5c-d3ce2a231ed4] received
[2024-08-26 06:38:03,051: INFO/MainProcess] pidbox: Connected to redis://:**@redis:6379/1.
Building prefix dict from the default dictionary ...
[2024-08-26 06:38:03,780: DEBUG/MainProcess] Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
[2024-08-26 06:38:05,504: DEBUG/MainProcess] Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.854 seconds.
[2024-08-26 06:38:05,634: DEBUG/MainProcess] Loading model cost 1.854 seconds.
Prefix dict has been built successfully.
[2024-08-26 06:38:05,634: DEBUG/MainProcess] Prefix dict has been built successfully.
[2024-08-26 06:38:13,224: INFO/MainProcess] Processed dataset: d6621a32-4277-4594-b422-111da86ead7a latency: 10.634749519987963
[2024-08-26 06:38:13,262: INFO/MainProcess] Task tasks.document_indexing_task.document_indexing_task[83039707-b89e-4a4b-a1ab-7b67e8983ed5] succeeded in 10.672490199911408s: None
[2024-08-26 06:38:13,264: INFO/MainProcess] Start deal dataset vector index: d6621a32-4277-4594-b422-111da86ead7a
[2024-08-26 06:38:13,314: INFO/MainProcess] Task tasks.clean_document_task.clean_document_task[d3a44cbf-e943-48c9-93a6-7d5a24f83ea1] received
[2024-08-26 06:38:13,576: INFO/MainProcess] Deal dataset vector index: d6621a32-4277-4594-b422-111da86ead7a latency: 0.3124470600159839
[2024-08-26 06:38:13,613: INFO/MainProcess] Task tasks.deal_dataset_vector_index_task.deal_dataset_vector_index_task[22ad1ec2-8a5f-4895-bc5c-d3ce2a231ed4] succeeded in 0.3490739999106154s: None
[2024-08-26 06:38:13,614: INFO/MainProcess] Start clean document when document deleted: 86a4cef6-580c-459a-9fdb-f29c239b9b4e
[2024-08-26 06:38:13,631: INFO/MainProcess] Cleaned document when document deleted: 86a4cef6-580c-459a-9fdb-f29c239b9b4e latency: 0.016332659986801445
[2024-08-26 06:38:13,667: INFO/MainProcess] Task tasks.clean_document_task.clean_document_task[d3a44cbf-e943-48c9-93a6-7d5a24f83ea1] succeeded in 0.05289604002609849s: None
[2024-08-26 06:40:49,631: INFO/MainProcess] Task tasks.document_indexing_task.document_indexing_task[b543efb9-9ed0-41ab-93a4-1d14729ed60f] received
[2024-08-26 06:40:49,636: INFO/MainProcess] Start process document: 2ea2b1b5-2f7f-41e5-82f0-7d3fb1dc31e0
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
Running migrations
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
From the worker log, the successful part was when I uploaded the docx and it successfully processed, but after i uploaded the pdf, it returned the failed. I'm still new to all of these and could have some stupid problems. Please help me how to fix this issue thx.
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
I have seen a similar error but we are not able to reproduce this.
Self Checks
Dify version
0.7.0
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
The pdf file can't be index, but docx could. I use a local dify, bge ranker and bge m3 for embedding and reranker, and I also use online qwen embedding. Both of these methods works fine when uploading docx and run perfectly fine. However, when i upload a pdf, the index seems to take forever. I waited for several hours for Dify to index a 10-page long pdf with no ocr and it's still indexing. Incase it's the knowledge base setup issue, I opened up a base with online embedding and run no rerank, but it will no help
indexed
✔️ Expected Behavior
Index pdf
❌ Actual Behavior
Not index pdf