opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.28k stars 845 forks source link

Failed to process 54.pdf: 'PDFObjRef' object is not iterable #198

Closed Lincyaw closed 1 month ago

Lincyaw commented 1 month ago

Description of the bug | 错误描述

脚本与这个 issue 相同,这个 pdf 输入会触发

2024-07-23 14:50:10.631 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:116 - doc analyze cost: 21.791197299957275
2024-07-23 14:50:11.331 | WARNING  | magic_pdf.pre_proc.equations_replace:replace_inline_equations:453 - 行内公式没有替换成功:{'bbox': [497, 444, 504, 453], 'sco
2024-07-23 14:50:12.614 | INFO     | magic_pdf.pipe.UNIPipe:pipe_mk_markdown:48 - uni_pipe mk mm_markdown finished
2024-07-23 14:50:12.614 | INFO     | __main__:process_pdf_file:41 - Processed '13.pdf' and generated '13.md'
root@034cece9ab64:/code# python main.py
2024-07-23 14:50:48.146 | ERROR    | __main__:process_pdf_file:43 - Failed to process 54.pdf: 'PDFObjRef' object is not iterable
Traceback (most recent call last):

  File "/code/main.py", line 58, in <module>
    process_pdf_files_in_directory(directory_path)
    │                              └ 'papers'
    └ <function process_pdf_files_in_directory at 0x7d2ac60936d0>

  File "/code/main.py", line 50, in process_pdf_files_in_directory
    process_pdf_file(directory, pdf_file)
    │                │          └ '54.pdf'
    │                └ 'papers'
    └ <function process_pdf_file at 0x7d2ac62fbd90>

> File "/code/main.py", line 28, in process_pdf_file
    pipe.pipe_classify()
    │    └ <function UNIPipe.pipe_classify at 0x7d2a42ebe290>
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7d2ac3c4fa00>

  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/pipe/UNIPipe.py", line 25, in pipe_classify
    self.pdf_type = AbsPipe.classify(self.pdf_bytes)
    │    │          │       │        │    └ b'%PDF-1.5\n%\xbf\xf7\xa2\xfe\n835 0 obj\n<< /Linearized 1 /L 1821626 /H [ 4464 431 ] /O 839 /E 76405 /N 12 /T 1816344
    │    │          │       │        └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7d2ac3c4fa00>
    │    │          │       └ <staticmethod(<function AbsPipe.classify at 0x7d2a8d5a4c10>)>
    │    │          └ <class 'magic_pdf.pipe.AbsPipe.AbsPipe'>
    │    └ ''
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7d2ac3c4fa00>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/pipe/AbsPipe.py", line 63, in classify
    pdf_meta = pdf_meta_scan(pdf_bytes)
               │             └ b'%PDF-1.5\n%\xbf\xf7\xa2\xfe\n835 0 obj\n<< /Linearized 1 /L 1821626 /H [ 4464 431 ] /O 839 /E 76405 /N 12 /T 1816344 >>\nen...
               └ <function pdf_meta_scan at 0x7d2a8d5a40d0>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/filter/pdf_meta_scan.py", line 339, in pdf_meta_scan
    invalid_chars = check_invalid_chars(pdf_bytes)
                    │                   └ b'%PDF-1.5\n%\xbf\xf7\xa2\xfe\n835 0 obj\n<< /Linearized 1 /L 1821626 /H [ 4464 431 ] /O 839 /E 76405 /N 12 /T 1816344 >
                    └ <function check_invalid_chars at 0x7d2a8d5a4040>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/filter/pdf_meta_scan.py", line 305, in check_invalid_chars
    return detect_invalid_chars(pdf_bytes)
           │                    └ b'%PDF-1.5\n%\xbf\xf7\xa2\xfe\n835 0 obj\n<< /Linearized 1 /L 1821626 /H [ 4464 431 ] /O 839 /E 76405 /N 12 /T 1816344 >>\nen...
           └ <function detect_invalid_chars at 0x7d2a8d59fac0>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf-0.6.1-py3.10.egg/magic_pdf/libs/pdf_check.py", line 44, in detect_invalid_chars
    text = extract_text(sample_pdf_file_like_object)
           │            └ <_io.BytesIO object at 0x7d2a42e6ff10>
           └ <function extract_text at 0x7d2a9299e4d0>
  File "/opt/mineru_venv/lib/python3.10/site-packages/pdfminer/high_level.py", line 169, in extract_text
    for page in PDFPage.get_pages(
                │       └ <classmethod(<function PDFPage.get_pages at 0x7d2a8d58e170>)>
                └ <class 'pdfminer.pdfpage.PDFPage'>
  File "/opt/mineru_venv/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 171, in get_pages
    for (pageno, page) in enumerate(cls.create_pages(doc)):
                                    │   │            └ <pdfminer.pdfdocument.PDFDocument object at 0x7d2a42eb1990>
                                    │   └ <classmethod(<function PDFPage.create_pages at 0x7d2a8d58e0e0>)>
                                    └ <class 'pdfminer.pdfpage.PDFPage'>
  File "/opt/mineru_venv/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 127, in create_pages
    yield cls(document, objid, tree, next(page_labels))
          │   │         │      │          └ repeat(None)
          │   │         │      └ {'Type': /'Page', 'Contents': [<PDFObjRef:3>], 'Resources': <PDFObjRef:4>, 'MediaBox': <PDFObjRef:26>, 'Annots': [<PDFObjRef:...
          │   │         └ 27
          │   └ <pdfminer.pdfdocument.PDFDocument object at 0x7d2a42eb1990>
          └ <class 'pdfminer.pdfpage.PDFPage'>
  File "/opt/mineru_venv/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 63, in __init__
    mediabox_params: List[Any] = [
                     │    └ typing.Any
                     └ typing.List

TypeError: 'PDFObjRef' object is not iterable

How to reproduce the bug | 如何复现

使用这个文件复现

54.pdf

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

myhloli commented 1 month ago

之前有另一个用户反馈了这个问题: https://github.com/opendatalab/MinerU/issues/191 pdfminer.six最新版引入的新bug,我试了下在20231228版本上表现良好,因此建议使用

pip install pdfminer.six==20231228

来解决这个问题 修复参考:https://github.com/opendatalab/MinerU/commit/27e98a8130ec7b67b62f9260a7fc72ffe1e481a8