infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
15.77k stars 1.6k forks source link

[Bug]: parse pdf file error #444

Closed ben-qiao closed 4 months ago

ben-qiao commented 4 months ago

Is there an existing issue for the same bug?

Branch name

main

Commit ID

fjoiesjf0923iur092jdpo2

Other environment information

linux
docker install ragflow
copy deepdoc model manually because of the "No such file or directory: '/ragflow/rag/res/deepdoc/ocr.res'be0c1e50eef6047b412d1800aa89aba4d275f997/ocr.res"

Actual behavior

the ragflow start normally. but when import pdf file, it report error: Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.

Expected behavior

the pdf file can be parsed normally

Steps to reproduce

create a knowledge base and then import pdf file

Additional information

No response

Jiafan commented 4 months ago

I have tested with the pdf file on my server , It works fine.

Screenshot 2024-04-19 at 10 10 49

But I have checked my docker path. The ocr.res file located in /ragflow/rag/res : /ragflow/rag/res/ocr.res . It's different than the path you mentioned

@ben-qiao

Screenshot 2024-04-19 at 10 13 27

ben-qiao commented 4 months ago

I have tested with the pdf file on my server , It works fine.

Screenshot 2024-04-19 at 10 10 49

But I have checked my docker path. The ocr.res file located in /ragflow/rag/res : /ragflow/rag/res/ocr.res . It's different than the path you mentioned

@ben-qiao

Screenshot 2024-04-19 at 10 13 27

I have tested with the pdf file on my server , It works fine.

Screenshot 2024-04-19 at 10 10 49

But I have checked my docker path. The ocr.res file located in /ragflow/rag/res : /ragflow/rag/res/ocr.res . It's different than the path you mentioned

@ben-qiao

Screenshot 2024-04-19 at 10 13 27

i checked my path, the ocr.res file is same path: image

KevinHuSh commented 4 months ago

I have tested with the pdf file on my server , It works fine.

Screenshot 2024-04-19 at 10 10 49

But I have checked my docker path. The ocr.res file located in /ragflow/rag/res : /ragflow/rag/res/ocr.res . It's different than the path you mentioned

@ben-qiao

Screenshot 2024-04-19 at 10 13 27

  1. Manually download the resource files from huggingface.co/InfiniFlow/deepdoc to your local folder ~/deepdoc.
  2. Add a volumes to docker-compose.yml, for example:
    • ~/deepdoc:/ragflow/rag/res/deepdoc
KevinHuSh commented 4 months ago

Is there an existing issue for the same bug?

  • [x] I have checked the existing issues.

Branch name

main

Commit ID

fjoiesjf0923iur092jdpo2

Other environment information

linux
docker install ragflow
copy deepdoc model manually because of the "No such file or directory: '/ragflow/rag/res/deepdoc/ocr.res'be0c1e50eef6047b412d1800aa89aba4d275f997/ocr.res"

Actual behavior

the ragflow start normally. but when import pdf file, it report error: Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. Chunkking Java开发手册(黄山版).pdf/Java开发手册(黄山版).pdf: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.

Expected behavior

the pdf file can be parsed normally

Steps to reproduce

create a knowledge base and then import pdf file

Additional information

No response

If it happened on demo website, please delete and upload again. If it's local, check the status of minio.

ben-qiao commented 4 months ago
  • ragflow/rag/res/deepdoc

i download deepdoc from huggingface and add a volumes to docker-compose.yml, after ragflow startup, i import pdf file to kb, and get a new error:

''' WARNING] [2024-04-19 16:22:04,334] [synonym.init] [line:24]: Realtime synonym is disabled, since no redis connection. [WARNING] Load term.freq FAIL! [WARNING] Load term.freq FAIL! Traceback (most recent call last): File "/ragflow/deepdoc/parser/pdf_parser.py", line 42, in init self.updown_cnt_mdl.load_model(os.path.join( File "/root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/core.py", line 2588, in load_model _check_call(_LIB.XGBoosterLoadModel(self.handle, c_str(fname))) File "/root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/core.py", line 282, in _check_call raise XGBoostError(py_str(_LIB.XGBGetLastError())) xgboost.core.XGBoostError: [16:22:07] /workspace/dmlc-core/src/io/local_filesys.cc:209: Check failed: allow_null: LocalFileSystem::Open "/ragflow/rag/res/deepdoc/updown_concat_xgb.model": No such file or directory Stack trace: [bt] (0) /root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f002eec424e] [bt] (1) /root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0xcc9637) [0x7f002f9d3637] [bt] (2) /root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0xcb54ce) [0x7f002f9bf4ce] [bt] (3) /root/miniconda3/envs/py11/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0x18e) [0x7f002ee78ace] [bt] (4) /root/miniconda3/envs/py11/lib/python3.11/lib-dynload/../../libffi.so.8(+0xa052) [0x7f0157371052] [bt] (5) /root/miniconda3/envs/py11/lib/python3.11/lib-dynload/../../libffi.so.8(+0x8925) [0x7f015736f925] [bt] (6) /root/miniconda3/envs/py11/lib/python3.11/lib-dynload/../../libffi.so.8(ffi_call+0xde) [0x7f015737006e] [bt] (7) /root/miniconda3/envs/py11/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x92ba) [0x7f01573812ba] [bt] (8) /root/miniconda3/envs/py11/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x87e3) [0x7f01573807e3]

'''

ben-qiao commented 4 months ago

it is a configuration problem. i pull new version 0.3.0,and run docker with docker-compose-CN.yml(before a run docker-compose.yml). then import pdf, the file is parsed successfully. add new config in docker-compose-CN.yml: