opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
13.43k stars 1.01k forks source link

错误:pymupdf.mupdf.FzErrorFormat: code=7: cannot parse object (906 0 R),在output/鹏辉储能工商储/auto 只生成了images 和 layout.pdf, 无middle.json、model.json、origin.pdf和spans.pdf文件生成, 生成的layout.pdf无法打开。 #472

Closed pandaominggz closed 2 months ago

pandaominggz commented 2 months ago

Description of the bug | 错误描述

[08/22 01:29:24 fvcore.common.checkpoint]: [Checkpointer] Loading from /home/coder/.cache/modelscope/hub/wanderkid/PDF-Extract-Kit/models/Layout/model_final.pth ... 2024-08-22 01:29:24.268 | INFO | magic_pdf.model.pdf_extract_kit:init:148 - DocAnalysis init done! 2024-08-22 01:29:24.268 | INFO | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:98 - model init cost: 21.628090858459473 2024-08-22 01:29:33.187 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.64

0: 1312x1888 (no detections), 90.1ms Speed: 10.9ms preprocess, 90.1ms inference, 0.5ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:33.797 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:34.144 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.34

0: 1312x1888 (no detections), 25.4ms Speed: 10.8ms preprocess, 25.4ms inference, 0.4ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:34.182 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:34.504 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.32

0: 1312x1888 (no detections), 25.4ms Speed: 12.0ms preprocess, 25.4ms inference, 0.4ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:34.543 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:34.880 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.33

0: 1312x1888 (no detections), 25.4ms Speed: 10.9ms preprocess, 25.4ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:34.917 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:35.264 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.34

0: 1312x1888 (no detections), 25.4ms Speed: 9.9ms preprocess, 25.4ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:35.301 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:35.638 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.33

0: 1312x1888 1 embedding, 25.4ms Speed: 9.9ms preprocess, 25.4ms inference, 1.2ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:35.842 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 1, mfr time: 0.15 2024-08-22 01:29:36.172 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.32

0: 1312x1888 2 embeddings, 25.4ms Speed: 9.4ms preprocess, 25.4ms inference, 0.8ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:36.341 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 2, mfr time: 0.12 2024-08-22 01:29:36.670 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.32

0: 1312x1888 (no detections), 25.5ms Speed: 9.4ms preprocess, 25.5ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:36.706 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:37.018 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.31

0: 1312x1888 (no detections), 25.4ms Speed: 9.7ms preprocess, 25.4ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:37.054 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:37.399 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.34

0: 1312x1888 (no detections), 25.4ms Speed: 9.8ms preprocess, 25.4ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:37.436 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:37.839 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.4

0: 1312x1888 (no detections), 25.4ms Speed: 9.2ms preprocess, 25.4ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:37.875 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:38.208 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.33

0: 1312x1888 (no detections), 25.5ms Speed: 10.6ms preprocess, 25.5ms inference, 0.4ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:38.245 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:38.605 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.35

0: 1312x1888 2 embeddings, 25.4ms Speed: 9.3ms preprocess, 25.4ms inference, 0.7ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:38.915 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 2, mfr time: 0.26 2024-08-22 01:29:39.335 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.41

0: 1312x1888 1 embedding, 25.4ms Speed: 10.6ms preprocess, 25.4ms inference, 0.7ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:39.501 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 1, mfr time: 0.12 2024-08-22 01:29:39.894 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.39

0: 1312x1888 2 embeddings, 25.5ms Speed: 10.3ms preprocess, 25.5ms inference, 0.9ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:40.096 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 2, mfr time: 0.15 2024-08-22 01:29:40.496 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.39

0: 1312x1888 1 embedding, 25.4ms Speed: 9.2ms preprocess, 25.4ms inference, 0.8ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:40.766 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 1, mfr time: 0.23 2024-08-22 01:29:41.128 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.35

0: 1312x1888 1 embedding, 25.4ms Speed: 9.2ms preprocess, 25.4ms inference, 0.7ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:41.395 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 1, mfr time: 0.22 2024-08-22 01:29:41.750 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.35

0: 1312x1888 1 embedding, 25.4ms Speed: 9.2ms preprocess, 25.4ms inference, 0.7ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:42.221 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 1, mfr time: 0.43 2024-08-22 01:29:42.566 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.34

0: 1312x1888 (no detections), 25.4ms Speed: 10.0ms preprocess, 25.4ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:42.603 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:42.914 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.3

0: 1312x1888 (no detections), 25.5ms Speed: 9.5ms preprocess, 25.5ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:42.951 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:43.404 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.45

0: 1312x1888 (no detections), 25.5ms Speed: 10.9ms preprocess, 25.5ms inference, 0.4ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:43.442 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:43.824 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.38

0: 1312x1888 (no detections), 25.4ms Speed: 9.2ms preprocess, 25.4ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:43.860 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:44.209 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.34

0: 1312x1888 (no detections), 25.4ms Speed: 9.6ms preprocess, 25.4ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:44.245 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:44.251 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:124 - doc analyze cost: 11.700689554214478 2024-08-22 01:29:48.464 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 0, last_page_cost_time: 0.0 2024-08-22 01:29:49.834 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 1, last_page_cost_time: 1.37 2024-08-22 01:29:50.281 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 2, last_page_cost_time: 0.45 2024-08-22 01:29:50.463 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 3, last_page_cost_time: 0.18 2024-08-22 01:29:50.863 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 4, last_page_cost_time: 0.4 2024-08-22 01:29:51.435 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 5, last_page_cost_time: 0.57 2024-08-22 01:29:51.907 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 6, last_page_cost_time: 0.47 2024-08-22 01:29:52.424 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 7, last_page_cost_time: 0.52 2024-08-22 01:29:52.625 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 8, last_page_cost_time: 0.2 2024-08-22 01:29:52.952 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 9, last_page_cost_time: 0.33 2024-08-22 01:29:54.011 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 10, last_page_cost_time: 1.06 2024-08-22 01:29:54.192 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:169 - skip this page, page_id: 10, reason: too_many_layout_columns 2024-08-22 01:29:54.195 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 11, last_page_cost_time: 0.18 2024-08-22 01:29:54.733 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:162 - skip this page, page_id: 11, reason: complicated_layout 2024-08-22 01:29:54.734 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 12, last_page_cost_time: 0.54 2024-08-22 01:29:55.089 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:162 - skip this page, page_id: 12, reason: complicated_layout 2024-08-22 01:29:55.090 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 13, last_page_cost_time: 0.36 2024-08-22 01:29:55.458 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:169 - skip this page, page_id: 13, reason: too_many_layout_columns 2024-08-22 01:29:55.460 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 14, last_page_cost_time: 0.37 2024-08-22 01:29:55.809 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:162 - skip this page, page_id: 14, reason: complicated_layout 2024-08-22 01:29:55.809 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:169 - skip this page, page_id: 14, reason: too_many_layout_columns 2024-08-22 01:29:55.810 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 15, last_page_cost_time: 0.35 2024-08-22 01:29:56.111 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:162 - skip this page, page_id: 15, reason: complicated_layout 2024-08-22 01:29:56.112 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 16, last_page_cost_time: 0.3 2024-08-22 01:29:56.121 | WARNING | magic_pdf.pre_proc.equations_replace:replace_inline_equations:453 - 行内公式没有替换成功:{'bbox': [227, 736, 245, 746], 'score': 0.46, 'latex': '<\!2^{\circ}\mathrm{C},'} 2024-08-22 01:29:56.122 | WARNING | magic_pdf.pre_proc.equations_replace:replace_inline_equations:453 - 行内公式没有替换成功:{'bbox': [227, 736, 245, 746], 'score': 0.46, 'latex': '<\!2^{\circ}\mathrm{C},'} 2024-08-22 01:29:56.391 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:162 - skip this page, page_id: 16, reason: complicated_layout 2024-08-22 01:29:56.392 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 17, last_page_cost_time: 0.28 2024-08-22 01:29:56.650 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:162 - skip this page, page_id: 17, reason: complicated_layout 2024-08-22 01:29:56.651 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 18, last_page_cost_time: 0.26 2024-08-22 01:29:57.065 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 19, last_page_cost_time: 0.41 2024-08-22 01:29:58.475 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 20, last_page_cost_time: 1.41 2024-08-22 01:29:58.930 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:169 - skip this page, page_id: 20, reason: too_many_layout_columns 2024-08-22 01:29:58.931 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 21, last_page_cost_time: 0.46 2024-08-22 01:29:59.452 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:169 - skip this page, page_id: 21, reason: too_many_layout_columns 2024-08-22 01:29:59.453 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 22, last_page_cost_time: 0.52 2024-08-22 01:29:59.827 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:140 - 发现了列表,列表行数:[(0, 4)], [[0, 1, 2, 3, 4]] 2024-08-22 01:29:59.827 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:153 - 列表行的第0到第4行是列表 2024-08-22 01:30:00.329 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:140 - 发现了列表,列表行数:[(0, 3)], [[0, 1, 2, 3]] 2024-08-22 01:30:00.330 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:153 - 列表行的第0到第3行是列表 2024-08-22 01:30:00.331 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:140 - 发现了列表,列表行数:[(0, 2)], [[0, 1, 2]] 2024-08-22 01:30:00.331 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:153 - 列表行的第0到第2行是列表 2024-08-22 01:30:00.436 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:140 - 发现了列表,列表行数:[(0, 5)], [[0, 1, 2, 3, 4, 5]] 2024-08-22 01:30:00.437 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:153 - 列表行的第0到第5行是列表 2024-08-22 01:30:00.532 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:140 - 发现了列表,列表行数:[(0, 5)], [[0, 1, 2, 3, 4, 5]] 2024-08-22 01:30:00.532 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:153 - 列表行的第0到第5行是列表 2024-08-22 01:30:00.536 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:140 - 发现了列表,列表行数:[(0, 18)], [[0, 1, 2, 3, 4, 5]] 2024-08-22 01:30:00.536 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:153 - 列表行的第0到第18行是列表 2024-08-22 01:30:00.539 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:140 - 发现了列表,列表行数:[(4, 21)], [[4, 5]] 2024-08-22 01:30:00.539 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:153 - 列表行的第4到第21行是列表 2024-08-22 01:30:00.542 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:140 - 发现了列表,列表行数:[(0, 21)], [[0, 1, 2, 3, 4]] 2024-08-22 01:30:00.542 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:153 - 列表行的第0到第21行是列表 MuPDF error: syntax error: expected object number

MuPDF error: format error: Repair failed already - not trying again

2024-08-22 01:30:02.203 | ERROR | magic_pdf.tools.cli:parse_doc:69 - code=7: cannot parse object (906 0 R) Traceback (most recent call last):

File "/home/coder/miniconda3/envs/MinerU/bin/magic-pdf", line 8, in sys.exit(cli()) │ │ └ │ └ └ <module 'sys' (built-in)> File "/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) │ │ │ └ {} │ │ └ () │ └ <function BaseCommand.main at 0x7f10b95e67a0> └ File "/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) │ │ └ <click.core.Context object at 0x7f10b9a5fc10> │ └ <function Command.invoke at 0x7f10b95e7250> └ File "/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) │ │ │ │ │ └ {'path': '鹏辉储能工商储.pdf', 'output_dir': '', 'method': 'auto'} │ │ │ │ └ <click.core.Context object at 0x7f10b9a5fc10> │ │ │ └ <function cli at 0x7f0f6a8bd240> │ │ └ │ └ <function Context.invoke at 0x7f10b95e5fc0> └ <click.core.Context object at 0x7f10b9a5fc10> File "/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) │ └ {'path': '鹏辉储能工商储.pdf', 'output_dir': '', 'method': 'auto'} └ () File "/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 75, in cli parse_doc(path) │ └ '鹏辉储能工商储.pdf' └ <function cli..parse_doc at 0x7f10b986b6d0>

File "/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 60, in parse_doc do_parse( └ <function do_parse at 0x7f0f6a8bc790> File "/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/tools/common.py", line 74, in do_parse draw_layout_bbox(pdf_info, pdf_bytes, local_md_dir) │ │ │ └ 'output/鹏辉储能工商储/auto' │ │ └ b'%PDF-1.6\r%\xe2\xe3\xcf\xd3\r\n1580 0 obj\r<</Linearized 1/L 11927666/O 1582/E 1886576/N 23/T 11925922/H [ 534 610]>>\rendo... │ └ [{'preproc_blocks': [{'type': 'image', 'bbox': [0, 36, 1204, 825], 'blocks': [{'bbox': [0, 36, 1204, 825], 'type': 'image_bod... └ <function draw_layout_bbox at 0x7f10b4dc5630> File "/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/libs/draw_bbox.py", line 143, in draw_layout_bbox pdf_docs.save(f"{out_path}/layout.pdf") │ └ <function Document.save at 0x7f10b5f860e0> └ Document('', <memory, doc# 6>) File "/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/pymupdf/init.py", line 5452, in save mupdf.pdf_save_document(pdf, filename, opts) │ │ │ │ └ (do_incremental=0 do_pretty=0 do_ascii=0 do_compress=0 do_compress_images=0 do_compress_fonts=0 do_decompress=0 do_garbage=0 ... │ │ │ └ 'output/鹏辉储能工商储/auto/layout.pdf' │ │ └ <pymupdf.mupdf.PdfDocument; proxy of <Swig Object of type 'mupdf::PdfDocument ' at 0x7f0e7380b630> > │ └ <function pdf_save_document at 0x7f10b5ebfeb0> └ <module 'pymupdf.mupdf' from '/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/pymupdf/mupdf.py'> File "/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/pymupdf/mupdf.py", line 50692, in pdf_save_document return _mupdf.pdf_save_document(doc, filename, opts) │ │ │ │ └ (do_incremental=0 do_pretty=0 do_ascii=0 do_compress=0 do_compress_images=0 do_compress_fonts=0 do_decompress=0 do_garbage=0 ... │ │ │ └ 'output/鹏辉储能工商储/auto/layout.pdf' │ │ └ <pymupdf.mupdf.PdfDocument; proxy of <Swig Object of type 'mupdf::PdfDocument ' at 0x7f0e7380b630> > │ └ └ <module 'pymupdf._mupdf' from '/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/pymupdf/_mupdf.so'>

pymupdf.mupdf.FzErrorFormat: code=7: cannot parse object (906 0 R)

How to reproduce the bug | 如何复现

magic-pdf -p 鹏辉储能工商储.pdf

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.7.x

Device mode | 设备模式

cuda

pandaominggz commented 2 months ago

鹏辉储能工商储.pdf

myhloli commented 2 months ago

这个是pymupdf写出画框的pdf遇到的问题,这种可以直接运行demo.py提取markdown,不执行命令行画图就可以了。

pandaominggz commented 2 months ago

感谢,感谢