opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
13.43k stars 1.01k forks source link

magic-pdf, version 0.8.1 pdf 解析报错 #758

Closed wertyac closed 3 days ago

wertyac commented 1 week ago

Description of the bug | 错误描述

python =3.10 安装方式遵循doc文档。 CUDA version为11.8 Ubuntu 为22.04. 报如下错误,无法解析成功。 2024-10-18 09:32:53.186 | ERROR | magic_pdf.tools.cli:parse_doc:96 - Coordinate 'right' is less than 'left'

How to reproduce the bug | 如何复现

(mineru) health@222server:~$ magic-pdf -p small_ocr.pdf -o /home/health/pdf/ 2024-10-18 09:29:54.551 INFO magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 8, cid_chars_radio: 0.0 2024-10-18 09:29:54.553 WARNING magic_pdf.filter.pdf_classify_by_type:classify:334 - pdf is not classified by area and text_len, by_image_area: False, by_text: False, by_avg_words: False, by_img_num: True, by_text_layout: False, by_img_narrow_strips: False, by_invalid_chars: True 2024-10-18 09:30:02.736 INFO magic_pdf.model.pdf_extract_kit:init:180 - DocAnalysis init, this may take some times. apply_layout: True, apply_formula: True, apply_ocr: True, apply_table: False 2024-10-18 09:30:02.736 INFO magic_pdf.model.pdf_extract_kit:init:188 - using device: cuda 2024-10-18 09:30:02.736 INFO magic_pdf.model.pdf_extract_kit:init:190 - using models_dir: /home/health/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models CustomVisionEncoderDecoderModel init CustomMBartForCausalLM init CustomMBartDecoder init [10/18 09:30:19 detectron2]: Rank of current process: 0. World size: 1 [10/18 09:30:19 detectron2]: Environment info:

sys.platform linux Python 3.10.15 (main, Oct 3 2024, 07:27:34) [GCC 11.2.0] numpy 1.26.4 detectron2 0.6 @/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/detectron2 Compiler GCC 11.4 CUDA compiler not available DETECTRON2_ENV_MODULE PyTorch 2.3.1+cu121 @/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/torch PyTorch debug build False torch._C._GLIBCXX_USE_CXX11_ABI False GPU available Yes GPU 0 NVIDIA GeForce RTX 4070 (arch=8.9) Driver version 535.183.01 CUDA_HOME /usr/local/cuda-11.8 Pillow 11.0.0 torchvision 0.18.1+cu121 @/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/torchvision torchvision arch flags 5.0, 6.0, 7.0, 7.5, 8.0, 8.6, 9.0 fvcore 0.1.5.post20221221 iopath 0.1.9 cv2 4.6.0


PyTorch built with:

[10/18 09:30:19 detectron2]: Command line arguments: {'config_file': '/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml', 'resume': False, 'eval_only': False, 'num_gpus': 1, 'num_machines': 1, 'machine_rank': 0, 'dist_url': 'tcp://127.0.0.1:57823', 'opts': ['MODEL.WEIGHTS', '/home/health/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models/Layout/model_final.pth']} [10/18 09:30:19 detectron2]: Contents of args.config_file=/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml: AUG: DETR: true CACHE_DIR: ~/cache/huggingface CUDNN_BENCHMARK: false DATALOADER: ASPECT_RATIO_GROUPING: true FILTER_EMPTY_ANNOTATIONS: false NUM_WORKERS: 4 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: [] PROPOSAL_FILES_TRAIN: [] TEST:

[10/18 09:30:21 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /home/health/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models/Layout/model_final.pth ... [10/18 09:30:21 fvcore.common.checkpoint]: [Checkpointer] Loading from /home/health/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models/Layout/model_final.pth ... 2024-10-18 09:30:22.596 | INFO | magic_pdf.model.pdf_extract_kit:init:248 - DocAnalysis init done! 2024-10-18 09:30:22.596 | INFO | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:98 - model init cost: 28.04293656349182 2024-10-18 09:30:24.960 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 1.47

0: 1888x1312 219 embeddings, 81 isolateds, 106.1ms Speed: 18.1ms preprocess, 106.1ms inference, 42.8ms postprocess per image at shape (1, 3, 1888, 1312) 2024-10-18 09:32:48.691 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 300, mfr time: 132.87 2024-10-18 09:32:52.197 | INFO | magic_pdf.model.pdf_extract_kit:call:372 - ocr cost: 3.47 2024-10-18 09:32:52.901 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 0.7

0: 1888x1312 290 embeddings, 10 isolateds, 105.5ms Speed: 18.2ms preprocess, 105.5ms inference, 125.3ms postprocess per image at shape (1, 3, 1888, 1312) 2024-10-18 09:32:53.186 | ERROR | magic_pdf.tools.cli:parse_doc:96 - Coordinate 'right' is less than 'left' Traceback (most recent call last):

File "/home/health/anaconda3/envs/mineru/bin/magic-pdf", line 8, in sys.exit(cli()) │ │ └ │ └ └ <module 'sys' (built-in)> File "/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) │ │ │ └ {} │ │ └ () │ └ <function BaseCommand.main at 0x7e1eebe32950> └ File "/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) │ │ └ <click.core.Context object at 0x7e1eec00c880> │ └ <function Command.invoke at 0x7e1eebe33400> └ File "/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) │ │ │ │ │ └ {'path': 'small_ocr.pdf', 'output_dir': '/home/health/pdf/', 'method': 'auto', 'debug_able': False, 'start_pageid': 0, 'end... │ │ │ │ └ <click.core.Context object at 0x7e1eec00c880> │ │ │ └ <function cli at 0x7e1db96d53f0> │ │ └ │ └ <function Context.invoke at 0x7e1eebe32170> └ <click.core.Context object at 0x7e1eec00c880> File "/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) │ └ {'path': 'small_ocr.pdf', 'output_dir': '/home/health/pdf/', 'method': 'auto', 'debug_able': False, 'start_pageid': 0, 'end... └ () File "/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 102, in cli parse_doc(path) │ └ 'small_ocr.pdf' └ <function cli..parse_doc at 0x7e1eec0337f0>

File "/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 84, in parse_doc do_parse( └ <function do_parse at 0x7e1db96d4b80> File "/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/tools/common.py", line 79, in do_parse pipe.pipe_analyze() │ └ <function UNIPipe.pipe_analyze at 0x7e1db96d4d30> └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7e1db96ba770> File "/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pipe/UNIPipe.py", line 33, in pipe_analyze self.model_list = doc_analyze(self.pdf_bytes, ocr=True, │ │ │ │ └ b'%PDF-1.7\r\n%\xa1\xb3\xc5\xd7\r\n1 0 obj\r\n<</Pages 2 0 R /Type/Catalog>>\r\nendobj\r\n2 0 obj\r\n<</Count 8/Kids[ 4 0 R ... │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7e1db96ba770> │ │ └ <function doc_analyze at 0x7e1e5a4b95a0> │ └ [] └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7e1db96ba770> File "/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 129, in doc_analyze result = custom_model(img) │ └ array([[[255, 255, 255], │ [255, 255, 255], │ [255, 255, 255], │ ..., │ [255, 255, 255], │ [255... └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7e1db91afe50> File "/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 274, in call bbox_img = get_croped_image(Image.fromarray(image), [xmin, ymin, xmax, ymax]) │ │ │ │ │ │ │ └ 0 │ │ │ │ │ │ └ 2361 │ │ │ │ │ └ 88 │ │ │ │ └ 2491 │ │ │ └ array([[[255, 255, 255], │ │ │ [255, 255, 255], │ │ │ [255, 255, 255], │ │ │ ..., │ │ │ [255, 255, 255], │ │ │ [255... │ │ └ <function fromarray at 0x7e1cfd945900> │ └ <module 'PIL.Image' from '/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/PIL/Image.py'> └ <function get_croped_image at 0x7e1cebd8f250> File "/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/model/pek_sub_modules/post_process.py", line 16, in get_croped_image croped_img = image_pil.crop((x_min, y_min, x_max, y_max)) │ │ │ │ │ └ 0 │ │ │ │ └ 2361 │ │ │ └ 88 │ │ └ 2491 │ └ <function Image.crop at 0x7e1cfd90bac0> └ <PIL.Image.Image image mode=RGB size=3405x5000 at 0x7E1CD1172AA0> File "/home/health/anaconda3/envs/mineru/lib/python3.10/site-packages/PIL/Image.py", line 1305, in crop raise ValueError(msg) └ "Coordinate 'right' is less than 'left'"

ValueError: Coordinate 'right' is less than 'left'

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.8.x

Device mode | 设备模式

cuda

myhloli commented 1 week ago

从0: 1888x1312 219 embeddings, 81 isolateds, 106.1ms Speed: 18.1ms preprocess, 106.1ms inference, 42.8ms postprocess per image at shape (1, 3, 1888, 1312) 2024-10-18 09:32:48.691 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 300, mfr time: 132.87 这个日志来看的话,应该是安装时网络波动导致pytorch没有装好,运行时在mfd环境虚空产生了很多虚假的公式bbox。

建议使用网速较好的设备,以及镜像源的方式,减少安装过程中的不确定因素。

wertyac commented 1 week ago

从0: 1888x1312 219 embeddings, 81 isolateds, 106.1ms Speed: 18.1ms preprocess, 106.1ms inference, 42.8ms postprocess per image at shape (1, 3, 1888, 1312) 2024-10-18 09:32:48.691 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 300, mfr time: 132.87 这个日志来看的话,应该是安装时网络波动导致pytorch没有装好,运行时在mfd环境虚空产生了很多虚假的公式bbox。

建议使用网速较好的设备,以及镜像源的方式,减少安装过程中的不确定因素。

谢谢我重新安装一下试试,会不会和cuda和cudnn版本有关系?cuda用的是11.8.

myhloli commented 1 week ago

pytorch会使用pip安装cu12.1作为依赖,不会使用系统的cu11.8

wertyac commented 1 week ago

pytorch会使用pip安装cu12.1作为依赖,不会使用系统的cu11.8

奇怪,我换了一条服务器,ubuntu 22.04, CUDA 12.1, driver 550.120 python 3.10 同样的安装方案,这个不会报错。。。。太奇怪了。