opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.28k stars 845 forks source link

使用PDF-Extract-Kit解析出json文件后无法恢复格式 #191

Closed Halflifefa closed 1 month ago

Halflifefa commented 1 month ago

Description of the bug | 错误描述

无法将提取后json内容拼接成新文档,使用pdf文档为该链接提供的pdf

How to reproduce the bug | 如何复现

解析json

python pdf_extract.py --pdf data/LLMBook.pdf 

转换json

magic-pdf pdf-command --pdf "data/LLMBook.pdf" --model "output/LLMBook.json"

报错如下

magic-pdf pdf-command --pdf "data/LLMBook.pdf" --model "output/LLMBook.json"
2024-07-23 13:44:20.996 | INFO     | magic_pdf.cli.magicpdf:do_parse:91 - local output dir is /tmp/magic-pdf/LLMBook/auto
Traceback (most recent call last):
  File "/home/miniconda3/envs/MinerU/bin/magic-pdf", line 8, in <module>
    sys.exit(cli())
  File "/home/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/cli/magicpdf.py", line 325, in pdf_command
    do_parse(
  File "/home/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/cli/magicpdf.py", line 106, in do_parse
    pipe.pipe_classify()
  File "/home/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/pipe/UNIPipe.py", line 25, in pipe_classify
    self.pdf_type = AbsPipe.classify(self.pdf_bytes)
  File "/home/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/pipe/AbsPipe.py", line 63, in classify
    pdf_meta = pdf_meta_scan(pdf_bytes)
  File "/home/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/filter/pdf_meta_scan.py", line 339, in pdf_meta_scan
    invalid_chars = check_invalid_chars(pdf_bytes)
  File "/home/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/filter/pdf_meta_scan.py", line 305, in check_invalid_chars
    return detect_invalid_chars(pdf_bytes)
  File "/home/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/libs/pdf_check.py", line 44, in detect_invalid_chars
    text = extract_text(sample_pdf_file_like_object)
  File "/home/miniconda3/envs/MinerU/lib/python3.10/site-packages/pdfminer/high_level.py", line 169, in extract_text
    for page in PDFPage.get_pages(
  File "/home/miniconda3/envs/MinerU/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 171, in get_pages
    for (pageno, page) in enumerate(cls.create_pages(doc)):
  File "/home/miniconda3/envs/MinerU/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 127, in create_pages
    yield cls(document, objid, tree, next(page_labels))
  File "/home/miniconda3/envs/MinerU/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 63, in __init__
    mediabox_params: List[Any] = [
TypeError: 'PDFObjRef' object is not iterable

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

myhloli commented 1 month ago

测试了样本pdf,因为文件格式特殊,会导致依赖库pdfminer崩溃,对于本样本,可以在命令行加入 --method txt 跳过程序初始的pdf版本识别阶段。

Halflifefa commented 1 month ago

测试了样本pdf,因为文件格式特殊,会导致依赖库pdfminer崩溃,对于本样本,可以在命令行加入 --method txt 跳过程序初始的pdf版本识别阶段。

好的,成功了,感谢

myhloli commented 1 month ago

@Halflifefa 又有其他用户反馈这个问题了,确认了下是只在最新版pdfminer.six上才会出现 因此可以用

pip install pdfminer.six==20231228

来修复此问题