opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://mineru.readthedocs.io/
GNU Affero General Public License v3.0
16.11k stars 1.16k forks source link

magic_pdf.user_api:parse_pdf:97 - string index out of range #972

Open yibie opened 9 hours ago

yibie commented 9 hours ago

Description of the bug | 错误描述

测试 MinerU 转换一个 PDF 的时候,出现如下错误:

2024-11-15 22:06:31.643 | ERROR    | magic_pdf.user_api:parse_pdf:97 - string index out of range
Traceback (most recent call last):

  File "/opt/homebrew/bin/magic-pdf", line 8, in <module>
    sys.exit(cli())
    │   │    └ <Command cli>
    │   └ <built-in function exit>
    └ <module 'sys' (built-in)>
  File "/opt/homebrew/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           │    │     │       └ {}
           │    │     └ ()
           │    └ <function BaseCommand.main at 0x1057cc1f0>
           └ <Command cli>
  File "/opt/homebrew/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         │    │      └ <click.core.Context object at 0x10552b700>
         │    └ <function Command.invoke at 0x1057ccca0>
         └ <Command cli>
  File "/opt/homebrew/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           │   │      │    │           │   └ {'path': '/Volumes/Collect/archives/Indexing The Manual of Good Practice 2013.pdf', 'output_dir': '/Users/chenyibin/Documents...
           │   │      │    │           └ <click.core.Context object at 0x10552b700>
           │   │      │    └ <function cli at 0x137ad32e0>
           │   │      └ <Command cli>
           │   └ <function Context.invoke at 0x1057b79a0>
           └ <click.core.Context object at 0x10552b700>
  File "/opt/homebrew/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
                       │       └ {'path': '/Volumes/Collect/archives/Indexing The Manual of Good Practice 2013.pdf', 'output_dir': '/Users/chenyibin/Documents...
                       └ ()
  File "/opt/homebrew/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 115, in cli
    parse_doc(path)
    │         └ '/Volumes/Collect/archives/Indexing The Manual of Good Practice 2013.pdf'
    └ <function cli.<locals>.parse_doc at 0x104e5b910>
  File "/opt/homebrew/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 96, in parse_doc
    do_parse(
    └ <function do_parse at 0x137ad24d0>
  File "/opt/homebrew/lib/python3.10/site-packages/magic_pdf/tools/common.py", line 93, in do_parse
    pipe.pipe_parse()
    │    └ <function UNIPipe.pipe_parse at 0x137ad2c20>
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x137ad8940>
  File "/opt/homebrew/lib/python3.10/site-packages/magic_pdf/pipe/UNIPipe.py", line 44, in pipe_parse
    self.pdf_mid_data = parse_union_pdf(self.pdf_bytes, self.model_list, self.image_writer,
    │    │              │               │    │          │    │           │    └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x137ad8730>
    │    │              │               │    │          │    │           └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x137ad8940>
    │    │              │               │    │          │    └ [{'layout_dets': [{'category_id': 1, 'poly': [235.5786895751953, 856.4910888671875, 520.361328125, 856.4910888671875, 520.361...
    │    │              │               │    │          └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x137ad8940>
    │    │              │               │    └ b'%PDF-1.6\n%\xe2\xe3\xcf\xd3\n1 0 obj<</Height 5/BitsPerComponent 8/Width 5/Type/XObject/Subtype/Image/DecodeParms<</Colors ...
    │    │              │               └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x137ad8940>
    │    │              └ <function parse_union_pdf at 0x137ad20e0>
    │    └ None
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x137ad8940>
  File "/opt/homebrew/lib/python3.10/site-packages/magic_pdf/user_api.py", line 100, in parse_union_pdf
    pdf_info_dict = parse_pdf(parse_pdf_by_txt)
                    │         └ <function parse_pdf_by_txt at 0x137ad1fc0>
                    └ <function parse_union_pdf.<locals>.parse_pdf at 0x327e1b400>
> File "/opt/homebrew/lib/python3.10/site-packages/magic_pdf/user_api.py", line 88, in parse_pdf
    return method(
           └ <function parse_pdf_by_txt at 0x137ad1fc0>
  File "/opt/homebrew/lib/python3.10/site-packages/magic_pdf/pdf_parse_by_txt.py", line 15, in parse_pdf_by_txt
    return pdf_parse_union(dataset,
           │               └ <magic_pdf.data.dataset.PymuDocDataset object at 0x3205caec0>
           └ <function pdf_parse_union at 0x137ad1f30>
  File "/opt/homebrew/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core_v2.py", line 630, in pdf_parse_union
    para_split(pdf_info_dict, debug_mode=debug_mode)
    │          │                         └ True
    │          └ {'page_0': {'preproc_blocks': [{'type': 'title', 'bbox': [85, 84, 349, 107], 'lines': [{'bbox': [85, 84, 349, 107], 'spans': ...
    └ <function para_split at 0x13758f1c0>
  File "/opt/homebrew/lib/python3.10/site-packages/magic_pdf/para/para_split_v3.py", line 309, in para_split
    __para_merge_page(all_blocks)
    │                 └ [{'type': 'title', 'bbox': [85, 84, 349, 107], 'lines': [{'bbox': [85, 84, 349, 107], 'spans': [], 'index': 0}], 'index': 0, ...
    └ <function __para_merge_page at 0x13758f130>
  File "/opt/homebrew/lib/python3.10/site-packages/magic_pdf/para/para_split_v3.py", line 272, in __para_merge_page
    block_type = __is_list_or_index_block(block)
                 │                        └ {'type': 'text', 'bbox': [216, 56, 369, 573], 'lines': [{'bbox': [222.41268920898438, 59.88707733154297, 300.083251953125, 68...
                 └ <function __is_list_or_index_block at 0x13758eef0>
  File "/opt/homebrew/lib/python3.10/site-packages/magic_pdf/para/para_split_v3.py", line 197, in __is_list_or_index_block
    if lines_text_list[i][0].isdigit():
       │               └ 37
       └ ['bibliography (continued)', 'indexability of, 52', 'Otlet’s note card theory and, 9', 'presence of, and indexability of', 'r...

IndexError: string index out of range
2024-11-15 22:06:31.827 | WARNING  | magic_pdf.user_api:parse_union_pdf:102 - parse_pdf_by_txt drop or error, switch to parse_pdf_by_ocr

How to reproduce the bug | 如何复现

命令: magic-pdf -p /Volumes/Collect/archives/Indexing\ The\ Manual\ of\ Good\ Practice\ 2013.pdf -o ~/Documents/temp_convert/ -m auto

Operating system | 操作系统

MacOS

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.9.x

Device mode | 设备模式

cpu

myhloli commented 9 hours ago

可以上传一下出问题的pdf文件吗

myhloli commented 9 hours ago

看了下代码,这个应该是0.9.2的bug,在0.9.3应该已经修复了,可以尝试更新0.9.3再试下

yibie commented 9 hours ago

我能问一下,如何升级吗?

myhloli commented 8 hours ago

和安装命令一样

yibie commented 1 hour ago

升级之后已经没问题了,感谢!