opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
10.84k stars 797 forks source link

pymupdf.mupdf.FzErrorFormat: code=7: object out of range (0 0 R); xref size 161 #167

Closed shutter-cp closed 1 month ago

shutter-cp commented 1 month ago

Description of the bug | 错误描述

Traceback (most recent call last):
  File "/root/miniconda3/envs/MinerU/bin/magic-pdf", line 8, in <module>
    sys.exit(cli())
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/cli/magicpdf.py", line 325, in pdf_command
    do_parse(
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/cli/magicpdf.py", line 120, in do_parse
    draw_layout_bbox(pdf_info, pdf_bytes, local_md_dir)
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/libs/draw_bbox.py", line 142, in draw_layout_bbox
    pdf_docs.save(f"{out_path}/layout.pdf")
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/pymupdf/__init__.py", line 5444, in save
    mupdf.pdf_save_document(pdf, filename, opts)
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/pymupdf/mupdf.py", line 50563, in pdf_save_document
    return _mupdf.pdf_save_document(doc, filename, opts)
pymupdf.mupdf.FzErrorFormat: code=7: object out of range (0 0 R); xref size 161
(MinerU) [root@n01v PDF-Extract]# magic-pdf --version
magic-pdf, version 0.6.1

How to reproduce the bug | 如何复现

magic-pdf pdf-command --pdf ./test2.pdf --inside_model true

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cpu

myhloli commented 1 month ago

可能是pdf文件本身的问题导致依赖库pymupdf在保存可视化文件时发生错误,能上传一份报错的pdf文件用来测试吗?

shutter-cp commented 1 month ago

test2.pdf

myhloli commented 1 month ago

test2.pdf

复测,确认是由于pdf文档特殊导致pymupdf库再写出文件时导致了异常,markdown解析倒是没有问题,如果需要临时缓解此case类似情况,可以参考 https://github.com/opendatalab/MinerU/blob/master/demo/demo.py 通过api方式进行解析。 付解析结果: test2.zip

如果在使用过程中还有其他问题,可以继续反馈。