opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
https://opendatalab.com/OpenSourceTools?tool=extract
GNU Affero General Public License v3.0
18.18k stars 1.3k forks source link

请帮我看看我的这个问题,我在使用原本0.8.1版本的时候可以跑的pdf文件,在换用了新的框架之后出了问题 #1022

Closed farierer closed 4 days ago

farierer commented 4 days ago

Description of the bug | 错误描述

2024-11-19 11:00:29.883 | ERROR | magic_pdf.user_api:parse_pdf:97 - The expanded size of the tensor (567) must match the existing size (514) at non-singleton dimension 1. Target sizes: [1, 567]. Tensor sizes: [1, 514] Traceback (most recent call last):

File "/root/MinerU/run.py", line 193, in pdf_parse_main(pdf_path) │ └ '/root/PDF/error/02B20231201C_l.pdf' └ <function pdf_parse_main at 0x7f55d36d48b0>

File "/root/MinerU/run.py", line 137, in pdf_parse_main pipe.pipe_parse() │ └ <function UNIPipe.pipe_parse at 0x7f57681540d0> └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f563e17a320>

File "/root/MinerU/magic_pdf/pipe/UNIPipe.py", line 44, in pipe_parse self.pdf_mid_data = parse_union_pdf(self.pdf_bytes, self.model_list, self.image_writer, │ │ │ │ │ │ │ │ └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x7f584a677970> │ │ │ │ │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f563e17a320> │ │ │ │ │ │ └ [{'layout_dets': [{'category_id': 1, 'poly': [22.12286376953125, 2711.37548828125, 429.85791015625, 2711.37548828125, 429.857... │ │ │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f563e17a320> │ │ │ │ └ b'%PDF-1.4\r%\xe2\xe3\xcf\xd3\r\n1 0 obj\r\n<<\r\n/ModDate (D:20231201024525+08\'00\')\r\n/CreationDate (D:20231201024525+08... │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f563e17a320> │ │ └ <function parse_union_pdf at 0x7f576813be20> │ └ None └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f563e17a320>

File "/root/MinerU/magic_pdf/user_api.py", line 100, in parse_union_pdf pdf_info_dict = parse_pdf(parse_pdf_by_txt) │ └ <function parse_pdf_by_txt at 0x7f576813bd00> └ <function parse_union_pdf..parse_pdf at 0x7f55c4722dd0>

File "/root/MinerU/magic_pdf/user_api.py", line 88, in parse_pdf return method( └ <function parse_pdf_by_txt at 0x7f576813bd00>

File "/root/MinerU/magic_pdf/pdf_parse_by_txt.py", line 15, in parse_pdf_by_txt return pdf_parse_union(dataset, │ └ <magic_pdf.data.dataset.PymuDocDataset object at 0x7f55bd186170> └ <function pdf_parse_union at 0x7f576813bc70>

File "/root/MinerU/magic_pdf/pdf_parse_union_core_v2.py", line 617, in pdf_parse_union page_info = parse_page_core( └ <function parse_page_core at 0x7f576813bbe0>

File "/root/MinerU/magic_pdf/pdf_parse_union_core_v2.py", line 542, in parse_page_core sorted_bboxes = sort_lines_by_model(fix_blocks, page_w, page_h, line_height) │ │ │ │ └ 9 │ │ │ └ 1433.249755859375 │ │ └ 1026.0 │ └ [{'type': 'text', 'bbox': [7, 976, 154, 1111], 'lines': [{'bbox': [27.08985710144043, 976.544677734375, 152.90780639648438, 9... └ <function sort_lines_by_model at 0x7f576813b880>

File "/root/MinerU/magic_pdf/pdf_parse_union_core_v2.py", line 305, in sort_lines_by_model orders = do_predict(boxes, model) │ │ └ LayoutLMv3ForTokenClassification( │ │ (layoutlmv3): LayoutLMv3Model( │ │ (embeddings): LayoutLMv3TextEmbeddings( │ │ (word_em... │ └ [[26, 681, 149, 688], [9, 689, 149, 696], [9, 697, 149, 704], [9, 705, 149, 712], [9, 713, 149, 719], [9, 721, 149, 727], [9,... └ <function do_predict at 0x7f576813b5b0>

File "/root/MinerU/magic_pdf/pdf_parse_union_core_v2.py", line 172, in do_predict logits = model(**inputs).logits.cpu().squeeze(0) │ └ └ LayoutLMv3ForTokenClassification( (layoutlmv3): LayoutLMv3Model( (embeddings): LayoutLMv3TextEmbeddings( (word_em...

File "/root/anaconda3/envs/mineru/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) │ │ │ └ │ │ └ () │ └ <function Module._call_impl at 0x7f576e03ac20> └ LayoutLMv3ForTokenClassification( (layoutlmv3): LayoutLMv3Model( (embeddings): LayoutLMv3TextEmbeddings( (word_em... File "/root/anaconda3/envs/mineru/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) │ │ └ │ └ () └ <bound method LayoutLMv3ForTokenClassification.forward of LayoutLMv3ForTokenClassification( (layoutlmv3): LayoutLMv3Model( ... File "/root/anaconda3/envs/mineru/lib/python3.10/site-packages/transformers/models/layoutlmv3/modeling_layoutlmv3.py", line 1099, in forward outputs = self.layoutlmv3( └ LayoutLMv3ForTokenClassification( (layoutlmv3): LayoutLMv3Model( (embeddings): LayoutLMv3TextEmbeddings( (word_em... File "/root/anaconda3/envs/mineru/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) │ │ │ └ │ │ └ │ └ <function Module._call_impl at 0x7f576e03ac20> └ LayoutLMv3Model( (embeddings): LayoutLMv3TextEmbeddings( (word_embeddings): Embedding(50265, 1024, padding_idx=1) (... File "/root/anaconda3/envs/mineru/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) │ │ └ │ └ └ <bound method LayoutLMv3Model.forward of LayoutLMv3Model( (embeddings): LayoutLMv3TextEmbeddings( (word_embeddings): Em... File "/root/anaconda3/envs/mineru/lib/python3.10/site-packages/transformers/models/layoutlmv3/modeling_layoutlmv3.py", line 961, in forward position_ids = position_ids.expand_as(input_ids) │ │ └ │ └ <method 'expand_as' of 'torch._C.TensorBase' objects> └ 具体的报错如上,并非所有pdf都会出错,目前我也不敢说一定是哪的问题,如果您有空请帮我看一下

How to reproduce the bug | 如何复现

复现过程就是简单的运行magic_pdf_parse_main文件

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.9.x

Device mode | 设备模式

cuda

myhloli commented 4 days ago

这个bug在0.9.3修复了,更新一下就好了

farierer commented 4 days ago

谢谢您