pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.75k stars 533 forks source link

When i Use MinerU, Error: pymupdf.mupdf.FzErrorSyntax: code=8: Failed to decode JPX image. #4066

Closed CocoaML closed 6 days ago

CocoaML commented 6 days ago

Description of the bug

MinerU Error

The Pdf is this: part_4.pdf

How can I skip this error page and continue proceed to another page? Thanks.

How to reproduce the bug

log:

Traceback (most recent call last): File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/runpy.py", line 198, in _run_module_as_main return _run_code(code, main_globals, None, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/runpy.py", line 88, in _run_code exec(code, run_globals) File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/main.py", line 71, in cli.main() File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 501, in main run() File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 351, in run_file runpy.run_path(target, run_name="main") File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code _run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name) File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code exec(code, run_globals) File "/usr/mineru_test/mineru_project/mineru_project_python/minerU/magic_pdf_parse_main.py", line 151, in pdf_parse_main(pdf_path, output_dir="./out") File "/usr/mineru_test/mineru_project/mineru_project_python/minerU/magic_pdf_parse_main.py", line 108, in pdf_parse_main pipe.pipe_classify() File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/magic_pdf/pipe/UNIPipe.py", line 25, in pipe_classify self.pdf_type = AbsPipe.classify(self.pdf_bytes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/magic_pdf/pipe/AbsPipe.py", line 63, in classify pdf_meta = pdf_meta_scan(pdf_bytes) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/magic_pdf/filter/pdf_meta_scan.py", line 331, in pdf_meta_scan image_info_per_page, junk_img_bojids = get_image_info(doc, page_width_pts, page_height_pts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/magic_pdf/filter/pdf_meta_scan.py", line 119, in get_image_info page_result = process_image(page, junk_img_bojids) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/magic_pdf/filter/pdf_meta_scan.py", line 39, in process_image recs = page.get_image_rects(img, transform=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/pymupdf/utils.py", line 879, in get_image_rects pix = pymupdf.Pixmap(page.parent, xref) # make pixmap of the image to compute MD5 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/pymupdf/init.py", line 10110, in init img = mupdf.pdf_load_image(pdf, ref) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/pymupdf/mupdf.py", line 50901, in pdf_load_image return _mupdf.pdf_load_image(doc, obj) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ pymupdf.mupdf.FzErrorSyntax: code=8: Failed to decode JPX image

How can I skip this error page and continue proceed to another page? Looking forward to your reply,Thanks.

PyMuPDF version

1.24.14

Operating system

Linux

Python version

3.11

JorjMcKie commented 6 days ago

The image error is only reported by this message, processing continues: your script is not ended by an exception.

CocoaML commented 6 days ago

The image error is only reported by this message, processing continues: your script is not ended by an exception.

Thank you for your feedback.

The process was stopped because of an error pdf page. An exception was reported. The error log is:

File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/pymupdf/utils.py", line 879, in get_image_rects
pix = pymupdf.Pixmap(page.parent, xref) # make pixmap of the image to compute MD5
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/pymupdf/init.py", line 10110, in init
img = mupdf.pdf_load_image(pdf, ref)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/mineru_test/python/miniforge3/envs/mineru_project_py311/lib/python3.11/site-packages/pymupdf/mupdf.py", line 50901, in pdf_load_image
return _mupdf.pdf_load_image(doc, obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pymupdf.mupdf.FzErrorSyntax: code=8: Failed to decode JPX image

MinerU

Looking forward to further communication with you.

JorjMcKie commented 6 days ago

Moving this to "Discussions", as there is no PyMuPDF/MuPDF problem.