A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
GNU Affero General Public License v3.0
13.43k
stars
1.01k
forks
source link
错误:pymupdf.mupdf.FzErrorFormat: code=7: cannot parse object (906 0 R),在output/鹏辉储能工商储/auto 只生成了images 和 layout.pdf, 无middle.json、model.json、origin.pdf和spans.pdf文件生成, 生成的layout.pdf无法打开。 #472
Closed
pandaominggz closed 2 months ago
Description of the bug | 错误描述
[08/22 01:29:24 fvcore.common.checkpoint]: [Checkpointer] Loading from /home/coder/.cache/modelscope/hub/wanderkid/PDF-Extract-Kit/models/Layout/model_final.pth ... 2024-08-22 01:29:24.268 | INFO | magic_pdf.model.pdf_extract_kit:init:148 - DocAnalysis init done! 2024-08-22 01:29:24.268 | INFO | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:98 - model init cost: 21.628090858459473 2024-08-22 01:29:33.187 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.64
0: 1312x1888 (no detections), 90.1ms Speed: 10.9ms preprocess, 90.1ms inference, 0.5ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:33.797 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:34.144 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.34
0: 1312x1888 (no detections), 25.4ms Speed: 10.8ms preprocess, 25.4ms inference, 0.4ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:34.182 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:34.504 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.32
0: 1312x1888 (no detections), 25.4ms Speed: 12.0ms preprocess, 25.4ms inference, 0.4ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:34.543 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:34.880 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.33
0: 1312x1888 (no detections), 25.4ms Speed: 10.9ms preprocess, 25.4ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:34.917 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:35.264 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.34
0: 1312x1888 (no detections), 25.4ms Speed: 9.9ms preprocess, 25.4ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:35.301 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:35.638 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.33
0: 1312x1888 1 embedding, 25.4ms Speed: 9.9ms preprocess, 25.4ms inference, 1.2ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:35.842 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 1, mfr time: 0.15 2024-08-22 01:29:36.172 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.32
0: 1312x1888 2 embeddings, 25.4ms Speed: 9.4ms preprocess, 25.4ms inference, 0.8ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:36.341 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 2, mfr time: 0.12 2024-08-22 01:29:36.670 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.32
0: 1312x1888 (no detections), 25.5ms Speed: 9.4ms preprocess, 25.5ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:36.706 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:37.018 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.31
0: 1312x1888 (no detections), 25.4ms Speed: 9.7ms preprocess, 25.4ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:37.054 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:37.399 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.34
0: 1312x1888 (no detections), 25.4ms Speed: 9.8ms preprocess, 25.4ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:37.436 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:37.839 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.4
0: 1312x1888 (no detections), 25.4ms Speed: 9.2ms preprocess, 25.4ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:37.875 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:38.208 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.33
0: 1312x1888 (no detections), 25.5ms Speed: 10.6ms preprocess, 25.5ms inference, 0.4ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:38.245 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:38.605 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.35
0: 1312x1888 2 embeddings, 25.4ms Speed: 9.3ms preprocess, 25.4ms inference, 0.7ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:38.915 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 2, mfr time: 0.26 2024-08-22 01:29:39.335 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.41
0: 1312x1888 1 embedding, 25.4ms Speed: 10.6ms preprocess, 25.4ms inference, 0.7ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:39.501 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 1, mfr time: 0.12 2024-08-22 01:29:39.894 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.39
0: 1312x1888 2 embeddings, 25.5ms Speed: 10.3ms preprocess, 25.5ms inference, 0.9ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:40.096 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 2, mfr time: 0.15 2024-08-22 01:29:40.496 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.39
0: 1312x1888 1 embedding, 25.4ms Speed: 9.2ms preprocess, 25.4ms inference, 0.8ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:40.766 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 1, mfr time: 0.23 2024-08-22 01:29:41.128 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.35
0: 1312x1888 1 embedding, 25.4ms Speed: 9.2ms preprocess, 25.4ms inference, 0.7ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:41.395 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 1, mfr time: 0.22 2024-08-22 01:29:41.750 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.35
0: 1312x1888 1 embedding, 25.4ms Speed: 9.2ms preprocess, 25.4ms inference, 0.7ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:42.221 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 1, mfr time: 0.43 2024-08-22 01:29:42.566 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.34
0: 1312x1888 (no detections), 25.4ms Speed: 10.0ms preprocess, 25.4ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:42.603 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:42.914 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.3
0: 1312x1888 (no detections), 25.5ms Speed: 9.5ms preprocess, 25.5ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:42.951 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:43.404 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.45
0: 1312x1888 (no detections), 25.5ms Speed: 10.9ms preprocess, 25.5ms inference, 0.4ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:43.442 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:43.824 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.38
0: 1312x1888 (no detections), 25.4ms Speed: 9.2ms preprocess, 25.4ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:43.860 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:44.209 | INFO | magic_pdf.model.pdf_extract_kit:call:159 - layout detection cost: 0.34
0: 1312x1888 (no detections), 25.4ms Speed: 9.6ms preprocess, 25.4ms inference, 0.3ms postprocess per image at shape (1, 3, 1312, 1888) 2024-08-22 01:29:44.245 | INFO | magic_pdf.model.pdf_extract_kit:call:189 - formula nums: 0, mfr time: 0.0 2024-08-22 01:29:44.251 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:124 - doc analyze cost: 11.700689554214478 2024-08-22 01:29:48.464 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 0, last_page_cost_time: 0.0 2024-08-22 01:29:49.834 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 1, last_page_cost_time: 1.37 2024-08-22 01:29:50.281 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 2, last_page_cost_time: 0.45 2024-08-22 01:29:50.463 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 3, last_page_cost_time: 0.18 2024-08-22 01:29:50.863 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 4, last_page_cost_time: 0.4 2024-08-22 01:29:51.435 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 5, last_page_cost_time: 0.57 2024-08-22 01:29:51.907 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 6, last_page_cost_time: 0.47 2024-08-22 01:29:52.424 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 7, last_page_cost_time: 0.52 2024-08-22 01:29:52.625 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 8, last_page_cost_time: 0.2 2024-08-22 01:29:52.952 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 9, last_page_cost_time: 0.33 2024-08-22 01:29:54.011 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 10, last_page_cost_time: 1.06 2024-08-22 01:29:54.192 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:169 - skip this page, page_id: 10, reason: too_many_layout_columns 2024-08-22 01:29:54.195 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 11, last_page_cost_time: 0.18 2024-08-22 01:29:54.733 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:162 - skip this page, page_id: 11, reason: complicated_layout 2024-08-22 01:29:54.734 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 12, last_page_cost_time: 0.54 2024-08-22 01:29:55.089 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:162 - skip this page, page_id: 12, reason: complicated_layout 2024-08-22 01:29:55.090 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 13, last_page_cost_time: 0.36 2024-08-22 01:29:55.458 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:169 - skip this page, page_id: 13, reason: too_many_layout_columns 2024-08-22 01:29:55.460 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 14, last_page_cost_time: 0.37 2024-08-22 01:29:55.809 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:162 - skip this page, page_id: 14, reason: complicated_layout 2024-08-22 01:29:55.809 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:169 - skip this page, page_id: 14, reason: too_many_layout_columns 2024-08-22 01:29:55.810 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 15, last_page_cost_time: 0.35 2024-08-22 01:29:56.111 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:162 - skip this page, page_id: 15, reason: complicated_layout 2024-08-22 01:29:56.112 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 16, last_page_cost_time: 0.3 2024-08-22 01:29:56.121 | WARNING | magic_pdf.pre_proc.equations_replace:replace_inline_equations:453 - 行内公式没有替换成功:{'bbox': [227, 736, 245, 746], 'score': 0.46, 'latex': '<\!2^{\circ}\mathrm{C},'} 2024-08-22 01:29:56.122 | WARNING | magic_pdf.pre_proc.equations_replace:replace_inline_equations:453 - 行内公式没有替换成功:{'bbox': [227, 736, 245, 746], 'score': 0.46, 'latex': '<\!2^{\circ}\mathrm{C},'} 2024-08-22 01:29:56.391 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:162 - skip this page, page_id: 16, reason: complicated_layout 2024-08-22 01:29:56.392 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 17, last_page_cost_time: 0.28 2024-08-22 01:29:56.650 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:162 - skip this page, page_id: 17, reason: complicated_layout 2024-08-22 01:29:56.651 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 18, last_page_cost_time: 0.26 2024-08-22 01:29:57.065 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 19, last_page_cost_time: 0.41 2024-08-22 01:29:58.475 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 20, last_page_cost_time: 1.41 2024-08-22 01:29:58.930 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:169 - skip this page, page_id: 20, reason: too_many_layout_columns 2024-08-22 01:29:58.931 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 21, last_page_cost_time: 0.46 2024-08-22 01:29:59.452 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:169 - skip this page, page_id: 21, reason: too_many_layout_columns 2024-08-22 01:29:59.453 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 22, last_page_cost_time: 0.52 2024-08-22 01:29:59.827 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:140 - 发现了列表,列表行数:[(0, 4)], [[0, 1, 2, 3, 4]] 2024-08-22 01:29:59.827 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:153 - 列表行的第0到第4行是列表 2024-08-22 01:30:00.329 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:140 - 发现了列表,列表行数:[(0, 3)], [[0, 1, 2, 3]] 2024-08-22 01:30:00.330 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:153 - 列表行的第0到第3行是列表 2024-08-22 01:30:00.331 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:140 - 发现了列表,列表行数:[(0, 2)], [[0, 1, 2]] 2024-08-22 01:30:00.331 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:153 - 列表行的第0到第2行是列表 2024-08-22 01:30:00.436 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:140 - 发现了列表,列表行数:[(0, 5)], [[0, 1, 2, 3, 4, 5]] 2024-08-22 01:30:00.437 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:153 - 列表行的第0到第5行是列表 2024-08-22 01:30:00.532 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:140 - 发现了列表,列表行数:[(0, 5)], [[0, 1, 2, 3, 4, 5]] 2024-08-22 01:30:00.532 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:153 - 列表行的第0到第5行是列表 2024-08-22 01:30:00.536 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:140 - 发现了列表,列表行数:[(0, 18)], [[0, 1, 2, 3, 4, 5]] 2024-08-22 01:30:00.536 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:153 - 列表行的第0到第18行是列表 2024-08-22 01:30:00.539 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:140 - 发现了列表,列表行数:[(4, 21)], [[4, 5]] 2024-08-22 01:30:00.539 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:153 - 列表行的第4到第21行是列表 2024-08-22 01:30:00.542 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:140 - 发现了列表,列表行数:[(0, 21)], [[0, 1, 2, 3, 4]] 2024-08-22 01:30:00.542 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:153 - 列表行的第0到第21行是列表 MuPDF error: syntax error: expected object number
MuPDF error: format error: Repair failed already - not trying again
2024-08-22 01:30:02.203 | ERROR | magic_pdf.tools.cli:parse_doc:69 - code=7: cannot parse object (906 0 R) Traceback (most recent call last):
File "/home/coder/miniconda3/envs/MinerU/bin/magic-pdf", line 8, in
sys.exit(cli())
│ │ └
│ └
└ <module 'sys' (built-in)>
File "/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(args, kwargs)
│ │ │ └ {}
│ │ └ ()
│ └ <function BaseCommand.main at 0x7f10b95e67a0>
└
File "/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
│ │ └ <click.core.Context object at 0x7f10b9a5fc10>
│ └ <function Command.invoke at 0x7f10b95e7250>
└
File "/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, ctx.params)
│ │ │ │ │ └ {'path': '鹏辉储能工商储.pdf', 'output_dir': '', 'method': 'auto'}
│ │ │ │ └ <click.core.Context object at 0x7f10b9a5fc10>
│ │ │ └ <function cli at 0x7f0f6a8bd240>
│ │ └
│ └ <function Context.invoke at 0x7f10b95e5fc0>
└ <click.core.Context object at 0x7f10b9a5fc10>
File "/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback( args, **kwargs)
│ └ {'path': '鹏辉储能工商储.pdf', 'output_dir': '', 'method': 'auto'}
└ ()
File "/home/coder/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 75, in cli
parse_doc(path)
│ └ '鹏辉储能工商储.pdf'
└ <function cli..parse_doc at 0x7f10b986b6d0>
pymupdf.mupdf.FzErrorFormat: code=7: cannot parse object (906 0 R)
How to reproduce the bug | 如何复现
magic-pdf -p 鹏辉储能工商储.pdf
Operating system | 操作系统
Linux
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.7.x
Device mode | 设备模式
cuda