langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
50.8k stars 7.3k forks source link

PDF can't be index but docx can. #7631

Closed macfangyw closed 3 weeks ago

macfangyw commented 2 months ago

Self Checks

Dify version

0.7.0

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

The pdf file can't be index, but docx could. I use a local dify, bge ranker and bge m3 for embedding and reranker, and I also use online qwen embedding. Both of these methods works fine when uploading docx and run perfectly fine. However, when i upload a pdf, the index seems to take forever. I waited for several hours for Dify to index a 10-page long pdf with no ocr and it's still indexing. Incase it's the knowledge base setup issue, I opened up a base with online embedding and run no rerank, but it will no help 1724642285749

indexed

✔️ Expected Behavior

Index pdf

❌ Actual Behavior

Not index pdf

macfangyw commented 2 months ago

BTW, that two docx files index failed is my problem, the docx index works fine.

dosubot[bot] commented 2 months ago

Hey there, @macfangyw! I'm here to help you with any bugs, questions, or contributions you have. Let me know how I can assist you!

I found a similar issue where PDF files cannot be indexed while DOCX files can in Dify version 0.7.0. The problem arises because the community version defaults to using Dify ETL, which does support PDF files, but there might be differences in file extraction effects between Dify ETL and Unstructured ETL. The SaaS version defaults to using Unstructured ETL, which supports a broader range of file formats, including PDF.

To enable PDF indexing, you may need to switch to Unstructured ETL by configuring the appropriate environment variables:

Example configuration:

ETL_TYPE=Unstructured
UNSTRUCTURED_API_URL=http://unstructured:8000/general/v0/general

This should help resolve the issue with indexing PDF files [1][2][3][4].

To continue talking to Dosu, mention @dosu.

crazywoola commented 2 months ago

Please provide the container logs of api and worker.

macfangyw commented 2 months ago

the log for docker-api-1 --follow

Running migrations
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Preparing database migration...
Start database migration.
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO  [alembic.runtime.migration] Will assume transactional DDL.
Database migration successful!
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
[2024-08-26 06:01:53 +0000] [1] [INFO] Starting gunicorn 22.0.0
[2024-08-26 06:01:53 +0000] [1] [INFO] Listening at: http://0.0.0.0:5001 (1)
[2024-08-26 06:01:53 +0000] [1] [INFO] Using worker: gevent
[2024-08-26 06:01:53 +0000] [63] [INFO] Booting worker with pid: 63
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:02:49 +0000] [1] [ERROR] Worker (pid:61) exited with code 1
[2024-08-26 06:02:49 +0000] [1] [ERROR] Worker (pid:61) exited with code 1.
[2024-08-26 06:02:49 +0000] [1] [ERROR] Worker (pid:63) was sent code 133!
[2024-08-26 06:02:49 +0000] [65] [INFO] Booting worker with pid: 65
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:03:03 +0000] [1] [ERROR] Worker (pid:65) was sent code 133!
[2024-08-26 06:03:03 +0000] [67] [INFO] Booting worker with pid: 67
[2024-08-26 06:03:29 +0000] [1] [INFO] Handling signal: term
[2024-08-26 06:03:30 +0000] [67] [INFO] Worker exiting (pid: 67)
[2024-08-26 06:03:34 +0000] [1] [INFO] Shutting down: Master
Running migrations
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Preparing database migration...
Start database migration.
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO  [alembic.runtime.migration] Will assume transactional DDL.
Database migration successful!
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
[2024-08-26 06:27:28 +0000] [1] [INFO] Starting gunicorn 22.0.0
[2024-08-26 06:27:28 +0000] [1] [INFO] Listening at: http://0.0.0.0:5001 (1)
[2024-08-26 06:27:28 +0000] [1] [INFO] Using worker: gevent
[2024-08-26 06:27:28 +0000] [62] [INFO] Booting worker with pid: 62
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:28:28 +0000] [1] [ERROR] Worker (pid:60) exited with code 1
[2024-08-26 06:28:28 +0000] [1] [ERROR] Worker (pid:60) exited with code 1.
[2024-08-26 06:28:28 +0000] [1] [ERROR] Worker (pid:62) was sent code 133!
[2024-08-26 06:28:28 +0000] [64] [INFO] Booting worker with pid: 64
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:28:35 +0000] [1] [ERROR] Worker (pid:64) was sent code 133!
[2024-08-26 06:28:35 +0000] [66] [INFO] Booting worker with pid: 66
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:31:03 +0000] [1] [ERROR] Worker (pid:66) was sent code 133!
[2024-08-26 06:31:03 +0000] [68] [INFO] Booting worker with pid: 68
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:31:07 +0000] [1] [ERROR] Worker (pid:68) was sent code 133!
[2024-08-26 06:31:07 +0000] [70] [INFO] Booting worker with pid: 70
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:31:26 +0000] [1] [ERROR] Worker (pid:70) was sent code 133!
[2024-08-26 06:31:26 +0000] [72] [INFO] Booting worker with pid: 72
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:31:50 +0000] [1] [ERROR] Worker (pid:72) was sent code 133!
[2024-08-26 06:31:50 +0000] [74] [INFO] Booting worker with pid: 74
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:32:10 +0000] [1] [ERROR] Worker (pid:74) was sent code 133!
[2024-08-26 06:32:10 +0000] [76] [INFO] Booting worker with pid: 76
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:32:14 +0000] [1] [ERROR] Worker (pid:76) was sent code 133!
[2024-08-26 06:32:14 +0000] [78] [INFO] Booting worker with pid: 78
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:36:41 +0000] [1] [ERROR] Worker (pid:78) was sent code 133!
[2024-08-26 06:36:41 +0000] [80] [INFO] Booting worker with pid: 80
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
[2024-08-26 06:36:50 +0000] [1] [ERROR] Worker (pid:80) was sent code 133!
[2024-08-26 06:36:50 +0000] [82] [INFO] Booting worker with pid: 82

LOG for worker

[2024-08-26 06:38:01,518: INFO/MainProcess] Connected to redis://:**@redis:6379/1
[2024-08-26 06:38:01,527: INFO/MainProcess] mingle: searching for neighbors
[2024-08-26 06:38:02,546: INFO/MainProcess] mingle: all alone
[2024-08-26 06:38:02,580: INFO/MainProcess] celery@1a168805e19f ready.
[2024-08-26 06:38:02,588: INFO/MainProcess] Task tasks.document_indexing_task.document_indexing_task[83039707-b89e-4a4b-a1ab-7b67e8983ed5] received
[2024-08-26 06:38:02,797: INFO/MainProcess] Start process document: 70e55be0-3e4e-43a8-9bb9-da35be7829fa
[2024-08-26 06:38:03,049: INFO/MainProcess] Task tasks.deal_dataset_vector_index_task.deal_dataset_vector_index_task[22ad1ec2-8a5f-4895-bc5c-d3ce2a231ed4] received
[2024-08-26 06:38:03,051: INFO/MainProcess] pidbox: Connected to redis://:**@redis:6379/1.
Building prefix dict from the default dictionary ...
[2024-08-26 06:38:03,780: DEBUG/MainProcess] Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
[2024-08-26 06:38:05,504: DEBUG/MainProcess] Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.854 seconds.
[2024-08-26 06:38:05,634: DEBUG/MainProcess] Loading model cost 1.854 seconds.
Prefix dict has been built successfully.
[2024-08-26 06:38:05,634: DEBUG/MainProcess] Prefix dict has been built successfully.
[2024-08-26 06:38:13,224: INFO/MainProcess] Processed dataset: d6621a32-4277-4594-b422-111da86ead7a latency: 10.634749519987963
[2024-08-26 06:38:13,262: INFO/MainProcess] Task tasks.document_indexing_task.document_indexing_task[83039707-b89e-4a4b-a1ab-7b67e8983ed5] succeeded in 10.672490199911408s: None
[2024-08-26 06:38:13,264: INFO/MainProcess] Start deal dataset vector index: d6621a32-4277-4594-b422-111da86ead7a
[2024-08-26 06:38:13,314: INFO/MainProcess] Task tasks.clean_document_task.clean_document_task[d3a44cbf-e943-48c9-93a6-7d5a24f83ea1] received
[2024-08-26 06:38:13,576: INFO/MainProcess] Deal dataset vector index: d6621a32-4277-4594-b422-111da86ead7a latency: 0.3124470600159839
[2024-08-26 06:38:13,613: INFO/MainProcess] Task tasks.deal_dataset_vector_index_task.deal_dataset_vector_index_task[22ad1ec2-8a5f-4895-bc5c-d3ce2a231ed4] succeeded in 0.3490739999106154s: None
[2024-08-26 06:38:13,614: INFO/MainProcess] Start clean document when document deleted: 86a4cef6-580c-459a-9fdb-f29c239b9b4e
[2024-08-26 06:38:13,631: INFO/MainProcess] Cleaned document when document deleted: 86a4cef6-580c-459a-9fdb-f29c239b9b4e latency: 0.016332659986801445
[2024-08-26 06:38:13,667: INFO/MainProcess] Task tasks.clean_document_task.clean_document_task[d3a44cbf-e943-48c9-93a6-7d5a24f83ea1] succeeded in 0.05289604002609849s: None
[2024-08-26 06:40:49,631: INFO/MainProcess] Task tasks.document_indexing_task.document_indexing_task[b543efb9-9ed0-41ab-93a4-1d14729ed60f] received
[2024-08-26 06:40:49,636: INFO/MainProcess] Start process document: 2ea2b1b5-2f7f-41e5-82f0-7d3fb1dc31e0
[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).
Running migrations
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
macfangyw commented 2 months ago

From the worker log, the successful part was when I uploaded the docx and it successfully processed, but after i uploaded the pdf, it returned the failed. I'm still new to all of these and could have some stupid problems. Please help me how to fix this issue thx.

crazywoola commented 2 months ago

[FATAL:partition_root.cc(863)] Check failed: (internal::SystemPageSize() == (size_t{1} << 12)) || (internal::SystemPageSize() == (size_t{1} << 14)).

I have seen a similar error but we are not able to reproduce this.