opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
https://opendatalab.com/OpenSourceTools?tool=extract
GNU Affero General Public License v3.0
17.96k stars 1.29k forks source link

AttributeError: 'tuple' object has no attribute 'shape' #1037

Open xuhongtian opened 1 day ago

xuhongtian commented 1 day ago

Description of the bug | 错误描述

当pdf是全部由图片组成时,好像无法检测识别,类似与demo目录下small_ocr.pdf,出现报错AttributeError: 'tuple' object has no attribute 'shape',但是执行demo1或demo2这类pdf可以。

How to reproduce the bug | 如何复现

只要我执行跑small_ocr.pdf就会出现,跑demo1.pdf就没问题,我是用的是magic_pdf_parse_main.py进行测试的 水洗和水洗机.pdf 这个是测试pdf 2024-11-20 15:35:44.333 | INFO | magic_pdf.model.pdf_extract_kit:init:137 - DocAnalysis init done! 2024-11-20 15:35:44.333 | INFO | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:131 - model init cost: 11.925917387008667 2024-11-20 15:35:54.142 | INFO | magic_pdf.model.pdf_extract_kit:call:153 - layout detection time: 8.82 2024-11-20 15:35:56.564 | INFO | magic_pdf.model.pdf_extract_kit:call:161 - mfd time: 2.41 2024-11-20 15:35:56.565 | INFO | magic_pdf.model.pdf_extract_kit:call:168 - formula nums: 0, mfr time: 0.0 2024-11-20 15:35:56.566 | ERROR | main:pdf_parse_main:140 - 'tuple' object has no attribute 'shape' Traceback (most recent call last):

File "/data/xht_test_code/MinerU-master/magic_pdf_parse_main.py", line 146, in pdf_parse_main(pdf_path) │ └ './demo/small_ocr.pdf' └ <function pdf_parse_main at 0x72ecdc925ab0>

File "/data/xht_test_code/MinerU-master/magic_pdf_parse_main.py", line 123, in pdf_parse_main pipe.pipe_analyze() # 解析 │ └ <function UNIPipe.pipe_analyze at 0x72ecdc924ee0> └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x72ecdcaebe20>

File "/data/xht_test_code/MinerU-master/magic_pdf/pipe/UNIPipe.py", line 37, in pipe_analyze self.model_list = doc_analyze(self.pdf_bytes, ocr=True, │ │ │ │ └ b'%PDF-1.7\r\n%\xa1\xb3\xc5\xd7\r\n1 0 obj\r\n<</Pages 2 0 R /Type/Catalog>>\r\nendobj\r\n2 0 obj\r\n<</Count 8/Kids[ 4 0 R ... │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x72ecdcaebe20> │ │ └ <function doc_analyze at 0x72ececdf04c0> │ └ [] └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x72ecdcaebe20>

File "/data/xht_test_code/MinerU-master/magic_pdf/model/doc_analyze_by_custom_model.py", line 166, in doc_analyze result = custom_model(img) │ └ array([[[255, 255, 255], │ [255, 255, 255], │ [255, 255, 255], │ ..., │ [255, 255, 255], │ [255... └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x72ecdcaebfa0>

File "/data/xht_test_code/MinerU-master/magic_pdf/model/pdf_extract_kit.py", line 186, in call ocr_res = self.ocr_model.ocr(new_image, mfd_res=adjusted_mfdetrec_res)[0] │ │ │ │ └ [] │ │ │ └ array([[[255, 255, 255], │ │ │ [255, 255, 255], │ │ │ [255, 255, 255], │ │ │ ..., │ │ │ [255, 255, 255], │ │ │ [255... │ │ └ <function ModifiedPaddleOCR.ocr at 0x72ec5b4eeb90> │ └ <magic_pdf.model.sub_modules.ocr.paddleocr.ppocr_273_mod.ModifiedPaddleOCR object at 0x72ec4b47fd00> └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x72ecdcaebfa0>

File "/data/xht_test_code/MinerU-master/magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_273_mod.py", line 67, in ocr img = preprocess_image(img) │ └ (array([[[255, 255, 255], │ [255, 255, 255], │ [255, 255, 255], │ ..., │ [255, 255, 255], │ [25... └ <function ModifiedPaddleOCR.ocr..preprocess_image at 0x72ec3bb07e20>

File "/data/xht_test_code/MinerU-master/magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_273_mod.py", line 57, in preprocess_image _image = alpha_to_color(_image, alpha_color) │ │ └ (255, 255, 255) │ └ (array([[[255, 255, 255], │ [255, 255, 255], │ [255, 255, 255], │ ..., │ [255, 255, 255], │ [25... └ <function alpha_to_color at 0x72ec5b4ec700>

File "/data/anaconda3/envs/embeding_env/lib/python3.10/site-packages/paddleocr/ppocr/utils/utility.py", line 107, in alpha_to_color if len(img.shape) == 3 and img.shape[2] == 4: │ └ (array([[[255, 255, 255], │ [255, 255, 255], │ [255, 255, 255], │ ..., │ [255, 255, 255], │ [25... └ (array([[[255, 255, 255], [255, 255, 255], [255, 255, 255], ..., [255, 255, 255], [25...

AttributeError: 'tuple' object has no attribute 'shape'

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.8.x

Device mode | 设备模式

cpu

myhloli commented 1 day ago

应该是paddle或者paddleocr没装好,我这边试是正常的,可以自己用个干净的环境测下能不能用paddleocr

xuhongtian commented 1 day ago

应该是paddle或者paddleocr没装好,我这边试是正常的,可以自己用个干净的环境测下能不能用paddleocr

pip install paddlepaddle paddleocr 这样安装的吗?

myhloli commented 1 day ago

应该是paddle或者paddleocr没装好,我这边试是正常的,可以自己用个干净的环境测下能不能用paddleocr

pip install paddlepaddle paddleocr 这样安装的吗?

我们项目用的版本会有一点点不同,不过你可以直接按paddle官方说明装一下试试能不能用

xuhongtian commented 1 day ago

应该是paddle或者paddleocr没装好,我这边试是正常的,可以自己用个干净的环境测下能不能用paddleocr

pip install paddlepaddle paddleocr 这样安装的吗?

我们项目用的版本会有一点点不同,不过你可以直接按paddle官方说明装一下试试能不能用

paddleocr 2.9.1 paddlepaddle 2.6.2这是我安装的版本,请问推荐paddle版本是?

myhloli commented 1 day ago

paddleocr 2.7.3 paddlepaddle 3.0.0b1

xuhongtian commented 1 day ago

paddleocr 2.7.3 paddlepaddle 3.0.0b1

好的,感谢,我去试一下