netease-youdao / QAnything

Question and Answer based on Anything.
https://qanything.ai
GNU Affero General Public License v3.0
11.53k stars 1.12k forks source link

[BUG] Error in Powerful PDF parsing,强力解析报错 #405

Open allentern opened 3 months ago

allentern commented 3 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

python模式,全CPU运行,调用外部大模型。 在config中打开PDF强力解析:

pdf解析参数

pdf_config = {

设置是否使用快速PDF解析器,设置为False时,使用优化后的PDF解析器,但速度下降

"USE_FAST_PDF_PARSER": False

} 运行,上传pdf,后台日志: Error in Powerful PDF parsing: PdfLoader.init() got an unexpected keyword argument 'root_dir', use fast PDF parser instead. ... insert_to_faiss: success num: 1, failed num: 0 从日志中看出来,强力解析出错,然后专用快速解析。

期望行为 | Expected Behavior

期望强力解析能够正常运行。

运行环境 | Environment

- OS:ubuntu 22.04
- NVIDIA Driver: 无
- CUDA: 无
- docker: 无
- docker-compose: 无
- NVIDIA GPU: 无
- NVIDIA GPU Memory: 无

QAnything日志 | QAnything logs

debug.log中的内容:

2024-06-18 10:40:56,518 - [PID: 88643][MainProcess] - [Function: upload_files] - INFO - upload_files zzp 2024-06-18 10:40:56,520 - [PID: 88643][MainProcess] - [Function: upload_files] - INFO - mode: strong 2024-06-18 10:40:56,524 - [PID: 88643][MainProcess] - [Function: check_kb_exist] - INFO - check_kb_exist [('KB2baad59dd8b346f79ae06061c86da883',)] 2024-06-18 10:40:56,525 - [PID: 88643][MainProcess] - [Function: upload_files] - INFO - ori name: 建筑光伏系统应用技术标准.pdf 2024-06-18 10:40:56,525 - [PID: 88643][MainProcess] - [Function: upload_files] - INFO - decode name: 建筑光伏系统应用技术标准.pdf 2024-06-18 10:40:56,525 - [PID: 88643][MainProcess] - [Function: upload_files] - INFO - cleaned name: 建筑光伏系统应用技术标准.pdf 2024-06-18 10:40:56,526 - [PID: 88643][MainProcess] - [Function: check_userexist] - INFO - check_user_exist [('zzp',)] 2024-06-18 10:40:56,527 - [PID: 88643][MainProcess] - [Function: check_kb_exist] - INFO - check_kb_exist [('KB2baad59dd8b346f79ae06061c86da883',)] 2024-06-18 10:40:56,530 - [PID: 88643][MainProcess] - [Function: add_file] - INFO - add_file: e87590666140418eba9d0f135d5ea390 2024-06-18 10:40:56,530 - [PID: 88643][MainProcess] - [Function: upload_files] - INFO - 建筑光伏系统应用技术标准.pdf, e87590666140418eba9d0f135d5ea390, success 2024-06-18 10:40:56,541 - [PID: 88643][MainProcess] - [Function: init] - INFO - success init localfile 建筑光伏系统应用技术标准.pdf 2024-06-18 10:40:56,545 - [PID: 88643][MainProcess] - [Function: insert_files_to_faiss] - INFO - insert_files_to_faiss: KB2baad59dd8b346f79ae06061c86da883 2024-06-18 10:40:56,546 - [PID: 88643][MainProcess] - [Function: split_file_to_docs] - WARNING - Error in Powerful PDF parsing: PdfLoader.init() got an unexpected keyword argument 'root_dir', use fast PDF parser instead. 2024-06-18 10:40:57,513 - [PID: 88643][MainProcess] - [Function: split_file_to_docs] - INFO - before 2nd split doc lens: 8 2024-06-18 10:40:57,514 - [PID: 88643][MainProcess] - [Function: split_file_to_docs] - INFO - after 2nd split doc lens: 8 2024-06-18 10:40:57,515 - [PID: 88643][MainProcess] - [Function: split_file_to_docs] - INFO - langchain analysis content head: 住房城乡建设部信息公开
浏览专用
住房城乡建设部信息公开
浏览专用
住房城乡建设部信息公开
浏览专用
住房城乡建设部信息公开
浏览专用
住房城乡建设部信息公开
浏览 2024-06-18 10:40:57,515 - [PID: 88643][MainProcess] - [Function: inner] - INFO - 函数 split_file_to_docs 执行耗时: 0.9691917896270752 秒 2024-06-18 10:40:57,518 - [PID: 88643][MainProcess] - [Function: insert_files_to_faiss] - INFO - split time: 0.9694967269897461 8 2024-06-18 10:40:57,521 - [PID: 88643][MainProcess] - [Function: load_vector_store] - INFO - load faiss index: /root/QAnything/QANY_DB/faiss/KB2baad59dd8b346f79ae06061c86da883/faiss_index 2024-06-18 10:40:58,044 - [PID: 88643][MainProcess] - [Function: _load_kb_to_memory] - INFO - FAISS load kb_ids: ['KB2baad59dd8b346f79ae06061c86da883'] 2024-06-18 10:40:58,046 - [PID: 88643][MainProcess] - [Function: get_len_safe_embeddings] - INFO - embedding number: 1 2024-06-18 10:40:59,334 - [PID: 88643][MainProcess] - [Function: get_embedding] - INFO - onnx infer time: 1.2814881801605225 2024-06-18 10:40:59,337 - [PID: 88643][MainProcess] - [Function: get_embedding] - INFO - embedding shape: (8, 768) 2024-06-18 10:40:59,342 - [PID: 88643][MainProcess] - [Function: inner] - INFO - 函数 get_len_safe_embeddings 执行耗时: 1.2964568138122559 秒 2024-06-18 10:40:59,357 - [PID: 88643][MainProcess] - [Function: add_document] - INFO - add documents number: 8 2024-06-18 10:40:59,363 - [PID: 88643][MainProcess] - [Function: add_document] - INFO - save faiss index: /root/QAnything/QANY_DB/faiss/KB2baad59dd8b346f79ae06061c86da883/faiss_index 2024-06-18 10:40:59,363 - [PID: 88643][MainProcess] - [Function: insert_files_to_faiss] - INFO - insert time: 1.847867727279663 2024-06-18 10:40:59,365 - [PID: 88643][MainProcess] - [Function: insert_files_to_faiss] - INFO - insert_to_faiss: success num: 1, failed num: 0 2024-06-18 10:41:22,223 - [PID: 88643][MainProcess] - [Function: list_docs] - INFO - list_docs zzp 2024-06-18 10:41:22,224 - [PID: 88643][MainProcess] - [Function: list_docs] - INFO - kb_id: KB2baad59dd8b346f79ae06061c86da883

复现方法 | Steps To Reproduce

1.python模式,全CPU运行,调用外部LLM。 2.config中打开强力解析。 3.启动。 4.上传PDF,观察日志。

备注 | Anything else?

No response

Sonder-JX commented 3 months ago

The same problem. Any solution?

fi5ee commented 2 months ago

一样遇到了这个问题