netease-youdao / QAnything

Question and Answer based on Anything.
https://qanything.ai
GNU Affero General Public License v3.0
11.92k stars 1.16k forks source link

[BUG] 在尝试单独使用PdfLoader出现问题 #367

Open tcy6 opened 6 months ago

tcy6 commented 6 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

我在self_pdf_loader.py在添加了这么几行代码,用来测试解析pdf的效果 file_path = 'cat.pdf' file_path = os.path.abspath(os.path.join(os.path.dirname(file), file_path)) loader = PdfLoader(filename=file_path, from_page=14, to_page=15, root_dir=os.path.dirname(file_path)) markdown_dir = loader.load_to_markdown() docs = convert_markdown_to_langchaindoc(markdown_dir) docs = PdfLoader.pdf_process(docs) print(docs)

但是却碰到了检索不到checkpoints的问题 Traceback (most recent call last): File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\core\test.py", line 203, in loader = PdfLoader(filename=file_path, root_dir=os.path.dirname(file_path)) File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\self_pdf_loader.py", line 14, in init super().init() File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\pdf_to_markdown\core\parser\pdf_parser.py", line 34, in init self.layouter = LayoutRecognizer("layout") File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\pdf_to_markdown\core\vision\layout_recognizer.py", line 20, in init super().init(self.labels, domain, model_dir) File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\pdf_to_markdown\core\vision\recognizer.py", line 21, in init raise ValueError("not find model file path {}".format( ValueError: not find model file path c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel/utils/loader/pdf_to_markdown\checkpoints/layout\layout.onnx

期望行为 | Expected Behavior

No response

运行环境 | Environment

- OS:
- NVIDIA Driver:
- CUDA:
- docker:
- docker-compose:
- NVIDIA GPU:
- NVIDIA GPU Memory:

QAnything日志 | QAnything logs

No response

复现方法 | Steps To Reproduce

No response

备注 | Anything else?

No response

milely commented 5 months ago

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

tcy6 commented 5 months ago

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

好的十分感谢,另外是不是Qanything无法处理没有文本元素的pdf啊,我截了一张图进行解析,发现有报错。如果是这样那它里面的ocr的意义是什么呢,是解析表格?

tcy6 commented 5 months ago

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

好的十分感谢,另外是不是Qanything无法处理没有文本元素的pdf啊,我截了一张图进行解析,发现有报错。如果是这样那它里面的ocr的意义是什么呢,是解析表格?

报错信息如下: <Logger debug_logger (INFO)> <Logger qa_logger (INFO)> LOCAL DATA PATH: c:\Users\Administrator\Desktop\QAnything-1.4.1\QANY_DB\content LOCAL_RERANK_REPO: netease-youdao/bce-reranker-base_v1 LOCAL_EMBED_REPO: netease-youdao/bce-embedding-base_v1 table model initing... cpu table model inited... WARNING:root:Miss outlines INFO:debug_logger:Start OCR! 1it [00:00, ?it/s] INFO:debug_logger:OCR finished in 0.15695199999026954 seconds preprocess 1it [00:00, ?it/s] Traceback (most recent call last): File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\core\test.py", line 204, in markdown_dir = loader.load_to_markdown() File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\self_pdf_loader.py", line 53, in load_to_markdown page_width = max([b["x1"] for b in self.boxes if b['layout_type'] == 'text']) - min( ValueError: max() arg is an empty sequence

milely commented 5 months ago

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

好的十分感谢,另外是不是Qanything无法处理没有文本元素的pdf啊,我截了一张图进行解析,发现有报错。如果是这样那它里面的ocr的意义是什么呢,是解析表格? The OCR module was removed due to slowly processing speed , and it can currently only handle parseable pdf files. Support for scanning image-based pdf files will be added in the future through a toggle switch.

tcy6 commented 5 months ago

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

好的十分感谢,另外是不是Qanything无法处理没有文本元素的pdf啊,我截了一张图进行解析,发现有报错。如果是这样那它里面的ocr的意义是什么呢,是解析表格? The OCR module was removed due to slowly processing speed , and it can currently only handle parseable pdf files. Support for scanning image-based pdf files will be added in the future through a toggle switch.

好的好的,十分感谢。既然不会ocr pdf,那感觉可以把pdf loader里面的ocr相关的东西先去掉,不然很迷惑人哈哈哈,明明都输出ocr finished了,但是实际上却没有ocr

xiehurricane commented 5 months ago

同感 上传一个单层PDF只有图片 就悲剧了 box找不到 直接报错 跟代码发现没有OCR

zhudongwork commented 5 months ago

Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]

好的十分感谢,另外是不是Qanything无法处理没有文本元素的pdf啊,我截了一张图进行解析,发现有报错。如果是这样那它里面的ocr的意义是什么呢,是解析表格?

报错信息如下: <Logger debug_logger (INFO)> <Logger qa_logger (INFO)> LOCAL DATA PATH: c:\Users\Administrator\Desktop\QAnything-1.4.1\QANY_DB\content LOCAL_RERANK_REPO: netease-youdao/bce-reranker-base_v1 LOCAL_EMBED_REPO: netease-youdao/bce-embedding-base_v1 table model initing... cpu table model inited... WARNING:root:Miss outlines INFO:debug_logger:Start OCR! 1it [00:00, ?it/s] INFO:debug_logger:OCR finished in 0.15695199999026954 seconds preprocess 1it [00:00, ?it/s] Traceback (most recent call last): File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\core\test.py", line 204, in markdown_dir = loader.load_to_markdown() File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\self_pdf_loader.py", line 53, in load_to_markdown page_width = max([b["x1"] for b in self.boxes if b['layout_type'] == 'text']) - min( ValueError: max() arg is an empty sequence

我也是一样的错误:Error in Powerful PDF parsing: max() arg is an empty sequence。关键是我传的是一页论文pdf,不是图片

SoonyangZhang commented 4 months ago

同感 上传一个单层PDF只有图片 就悲剧了 box找不到 直接报错 跟代码发现没有OCR

可以使用ocrmypdf 处理pdf。