Open tcy6 opened 6 months ago
Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]
Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]
好的十分感谢,另外是不是Qanything无法处理没有文本元素的pdf啊,我截了一张图进行解析,发现有报错。如果是这样那它里面的ocr的意义是什么呢,是解析表格?
Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]
好的十分感谢,另外是不是Qanything无法处理没有文本元素的pdf啊,我截了一张图进行解析,发现有报错。如果是这样那它里面的ocr的意义是什么呢,是解析表格?
报错信息如下:
<Logger debug_logger (INFO)> <Logger qa_logger (INFO)>
LOCAL DATA PATH: c:\Users\Administrator\Desktop\QAnything-1.4.1\QANY_DB\content
LOCAL_RERANK_REPO: netease-youdao/bce-reranker-base_v1
LOCAL_EMBED_REPO: netease-youdao/bce-embedding-base_v1
table model initing...
cpu
table model inited...
WARNING:root:Miss outlines
INFO:debug_logger:Start OCR!
1it [00:00, ?it/s]
INFO:debug_logger:OCR finished in 0.15695199999026954 seconds
preprocess
1it [00:00, ?it/s]
Traceback (most recent call last):
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\core\test.py", line 204, in
Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]
好的十分感谢,另外是不是Qanything无法处理没有文本元素的pdf啊,我截了一张图进行解析,发现有报错。如果是这样那它里面的ocr的意义是什么呢,是解析表格? The OCR module was removed due to slowly processing speed , and it can currently only handle parseable pdf files. Support for scanning image-based pdf files will be added in the future through a toggle switch.
Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]
好的十分感谢,另外是不是Qanything无法处理没有文本元素的pdf啊,我截了一张图进行解析,发现有报错。如果是这样那它里面的ocr的意义是什么呢,是解析表格? The OCR module was removed due to slowly processing speed , and it can currently only handle parseable pdf files. Support for scanning image-based pdf files will be added in the future through a toggle switch.
好的好的,十分感谢。既然不会ocr pdf,那感觉可以把pdf loader里面的ocr相关的东西先去掉,不然很迷惑人哈哈哈,明明都输出ocr finished了,但是实际上却没有ocr
同感 上传一个单层PDF只有图片 就悲剧了 box找不到 直接报错 跟代码发现没有OCR
Please download the pdf parser related checkpoints in modelscope [https://www.modelscope.cn/models/netease-youdao/QAnything-pdf-parser/files]
好的十分感谢,另外是不是Qanything无法处理没有文本元素的pdf啊,我截了一张图进行解析,发现有报错。如果是这样那它里面的ocr的意义是什么呢,是解析表格?
报错信息如下: <Logger debug_logger (INFO)> <Logger qa_logger (INFO)> LOCAL DATA PATH: c:\Users\Administrator\Desktop\QAnything-1.4.1\QANY_DB\content LOCAL_RERANK_REPO: netease-youdao/bce-reranker-base_v1 LOCAL_EMBED_REPO: netease-youdao/bce-embedding-base_v1 table model initing... cpu table model inited... WARNING:root:Miss outlines INFO:debug_logger:Start OCR! 1it [00:00, ?it/s] INFO:debug_logger:OCR finished in 0.15695199999026954 seconds preprocess 1it [00:00, ?it/s] Traceback (most recent call last): File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\core\test.py", line 204, in markdown_dir = loader.load_to_markdown() File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\self_pdf_loader.py", line 53, in load_to_markdown page_width = max([b["x1"] for b in self.boxes if b['layout_type'] == 'text']) - min( ValueError: max() arg is an empty sequence
我也是一样的错误:Error in Powerful PDF parsing: max() arg is an empty sequence。关键是我传的是一页论文pdf,不是图片
同感 上传一个单层PDF只有图片 就悲剧了 box找不到 直接报错 跟代码发现没有OCR
可以使用ocrmypdf 处理pdf。
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
我在self_pdf_loader.py在添加了这么几行代码,用来测试解析pdf的效果 file_path = 'cat.pdf' file_path = os.path.abspath(os.path.join(os.path.dirname(file), file_path)) loader = PdfLoader(filename=file_path, from_page=14, to_page=15, root_dir=os.path.dirname(file_path)) markdown_dir = loader.load_to_markdown() docs = convert_markdown_to_langchaindoc(markdown_dir) docs = PdfLoader.pdf_process(docs) print(docs)
但是却碰到了检索不到checkpoints的问题 Traceback (most recent call last): File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\core\test.py", line 203, in
loader = PdfLoader(filename=file_path, root_dir=os.path.dirname(file_path))
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\self_pdf_loader.py", line 14, in init
super().init()
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\pdf_to_markdown\core\parser\pdf_parser.py", line 34, in init
self.layouter = LayoutRecognizer("layout")
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\pdf_to_markdown\core\vision\layout_recognizer.py", line 20, in init
super().init(self.labels, domain, model_dir)
File "c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel\utils\loader\pdf_to_markdown\core\vision\recognizer.py", line 21, in init
raise ValueError("not find model file path {}".format(
ValueError: not find model file path c:\Users\Administrator\Desktop\QAnything-1.4.1\qanything_kernel/utils/loader/pdf_to_markdown\checkpoints/layout\layout.onnx
期望行为 | Expected Behavior
No response
运行环境 | Environment
QAnything日志 | QAnything logs
No response
复现方法 | Steps To Reproduce
No response
备注 | Anything else?
No response