opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
https://opendatalab.com/OpenSourceTools?tool=extract
GNU Affero General Public License v3.0
17.96k stars 1.29k forks source link

【可复现】报错:pymupdf.mupdf.FzErrorSyntax: code=8: Failed to decode JPX image #1034

Closed CocoaML closed 1 day ago

CocoaML commented 1 day ago

Description of the bug | 错误描述

【可复现】报错:pymupdf.mupdf.FzErrorSyntax: code=8: Failed to decode JPX image part_4.pdf

使用官网运行此文件依然报错: 官网地址:https://opendatalab.com/OpenSourceTools/Extractor

官网运行报错截图: ![Uploading MinerU异常-1.png…]()

How to reproduce the bug | 如何复现

报错日志:

Traceback (most recent call last): File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/runpy.py", line 198, in _run_module_as_main return _run_code(code, main_globals, None, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/runpy.py", line 88, in _run_code exec(code, run_globals) File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/main.py", line 71, in cli.main() File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 501, in main run() File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 351, in run_file runpy.run_path(target, run_name="main") File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code _run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name) File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code exec(code, run_globals) File "/usr/mineru_test/mineru_project/mineru_project_python/minerU/magic_pdf_parse_main.py", line 151, in pdf_parse_main(pdf_path, output_dir="./out") File "/usr/mineru_test/mineru_project/mineru_project_python/minerU/magic_pdf_parse_main.py", line 108, in pdf_parse_main pipe.pipe_classify() File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/magic_pdf/pipe/UNIPipe.py", line 25, in pipe_classify self.pdf_type = AbsPipe.classify(self.pdf_bytes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/magic_pdf/pipe/AbsPipe.py", line 63, in classify pdf_meta = pdf_meta_scan(pdf_bytes) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/magic_pdf/filter/pdf_meta_scan.py", line 331, in pdf_meta_scan image_info_per_page, junk_img_bojids = get_image_info(doc, page_width_pts, page_height_pts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/magic_pdf/filter/pdf_meta_scan.py", line 119, in get_image_info page_result = process_image(page, junk_img_bojids) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/magic_pdf/filter/pdf_meta_scan.py", line 39, in process_image recs = page.get_image_rects(img, transform=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/pymupdf/utils.py", line 879, in get_image_rects pix = pymupdf.Pixmap(page.parent, xref) # make pixmap of the image to compute MD5 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/pymupdf/init.py", line 10110, in init img = mupdf.pdf_load_image(pdf, ref) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/pymupdf/mupdf.py", line 50901, in pdf_load_image return _mupdf.pdf_load_image(doc, obj) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ pymupdf.mupdf.FzErrorSyntax: code=8: Failed to decode JPX image

Operating system | 操作系统

Linux

Python version | Python 版本

3.11

Software version | 软件版本 (magic-pdf --version)

0.7.x

Device mode | 设备模式

cuda

CocoaML commented 1 day ago

使用官网: 地址:https://opendatalab.com/OpenSourceTools/Extractor

报错截图: ![Uploading MinerU异常-1.png…]()

报错内容: err: "task_id": "ed69264-29c6-4280-ae1c-3cc334771fd", "state": -1, "data": ("result": ''3, "msg": "infer task led69264-29c6-4280-ae1c-3cc334771fd failed. Err: Infer failed.File analysis failed. Analyse failed. code=8: Failed to decode JPX image"}

myhloli commented 1 day ago

你这个pdf的第四页有问题,浏览器打开是空白的,可能是一张损坏了的图片,建议通过裁剪的方式删除第四页再解析

CocoaML commented 1 day ago

你这个pdf的第四页有问题,浏览器打开是空白的,可能是一张损坏了的图片,建议通过裁剪的方式删除第四页再解析

感谢您反馈。

如果PDF中有问题的页,后续可考虑升级:跳过不能解析的页面,继续完成PDF转md。我们有办法实现吗?期待您反馈。

myhloli commented 1 day ago

因为我们使用的解析库是第三方库,遇到报错的跳过逻辑您可以到 https://github.com/pymupdf/PyMuPDF 反馈。 我们目前只能通过try catch这个异常,跳过当前这一本的解析