opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.19k stars 835 forks source link

pdf解析报错 segmentation fault #505

Open audio-github-2020 opened 2 weeks ago

audio-github-2020 commented 2 weeks ago

Description of the bug | 错误描述

报错内容:

2024-08-29 16:20:52.704 | INFO | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 11, cid_chars_radio: 0.0 2024-08-29 16:20:52.705 | WARNING | magic_pdf.filter.pdf_classify_by_type:classify:334 - pdf is not classified by area and text_len, by_image_area: True, by_text: False, by_avg_words: False, by_img_num: True, by_text_layout: True, by_img_narrow_strips: True, by_invalid_chars: True [1] 53755 segmentation fault magic-pdf -p /Users/t/Downloads/attach/test.pdf -o -m auto

test.pdf

How to reproduce the bug | 如何复现

(MinerU) ✘ t@tMAC  ~  magic-pdf -p /Users/t/Downloads/attach/test.pdf -o /Users/t/Downloads/attach -m auto 2024-08-29 16:20:52.704 | INFO | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 11, cid_chars_radio: 0.0 2024-08-29 16:20:52.705 | WARNING | magic_pdf.filter.pdf_classify_by_type:classify:334 - pdf is not classified by area and text_len, by_image_area: True, by_text: False, by_avg_words: False, by_img_num: True, by_text_layout: True, by_img_narrow_strips: True, by_invalid_chars: True [1] 53755 segmentation fault magic-pdf -p /Users/t/Downloads/attach/test.pdf -o -m auto (MinerU) ✘ t@tMAC  ~ 

magic-pdf.json如下: { "models-dir": "/Users/t/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models", "device-mode":"cpu", "table-config": { "is_table_recog_enable": false, "max_time": 400 } }

Operating system | 操作系统

MacOS

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.7.x

Device mode | 设备模式

cpu