opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
https://opendatalab.com/OpenSourceTools?tool=extract
GNU Affero General Public License v3.0
18.9k stars 1.35k forks source link

偶尔会出现找不到PDF中的图片的错误,然后程序退出 #674

Open WXpiero opened 2 months ago

WXpiero commented 2 months ago

Description of the bug | 错误描述

我的机器很差,内存只有40G,怕解析中途内存爆了,在解析一些5000多页的PDF的时候,我会先把PDF切成80页一个的小文件,然后再用MAGIC-PDF去解析。然后一大堆文件中偶尔会看到回显有如下日志这样的找不到图片的错误,一旦出现这样的错误,这个PDF就不会有任何layout或者markdown文件被输出。 不知道是不是跟我切分了PDF文件导致的,一本书我切成3堆小PDF,会出现其中一堆小PDF全部都不会有输出的情况。

How to reproduce the bug | 如何复现

0: 1888x1472 (no detections), 5348.0ms Speed: 31.3ms preprocess, 5348.0ms inference, 0.0ms postprocess per image at shape (1, 3, 1888, 1472) 2024-09-29 06:12:10.477 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 0, mfr time: 0.0 2024-09-29 06:12:44.351 | INFO | magic_pdf.model.pdf_extract_kit:call:372 - ocr cost: 33.87 2024-09-29 06:13:05.565 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 21.21

0: 1888x1472 (no detections), 5281.3ms Speed: 47.3ms preprocess, 5281.3ms inference, 0.0ms postprocess per image at shape (1, 3, 1888, 1472) 2024-09-29 06:13:10.893 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 0, mfr time: 0.0 2024-09-29 06:13:28.802 | INFO | magic_pdf.model.pdf_extract_kit:call:372 - ocr cost: 17.91 2024-09-29 06:13:49.682 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 20.88

0: 1888x1472 1 embedding, 5265.1ms Speed: 31.3ms preprocess, 5265.1ms inference, 0.0ms postprocess per image at shape (1, 3, 1888, 1472) 2024-09-29 06:13:57.233 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 1, mfr time: 2.24 2024-09-29 06:14:37.120 | INFO | magic_pdf.model.pdf_extract_kit:call:372 - ocr cost: 39.87 2024-09-29 06:14:37.120 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:136 - doc analyze cost: 4823.803959131241 2024-09-29 06:14:37.827 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:242 - page_id: 0, last_page_cost_time: 0.0 2024-09-29 06:14:37.859 | ERROR | magic_pdf.user_api:parse_pdf:91 - [Errno 2] No such file or directory: 'C:\download\pdf\11\222\output\Braunwald_s_Heart_DiseaseA_Textbook_of_Cardiovascular_Medicinesplit_2_1total\Braunwald_s_Heart_DiseaseA_Textbook_of_Cardiovascular_Medicinesplit_2_split_9\auto\images\e6a42f9b1b5b49c9f8f6810f6a1f8e562f0c407c3c2b88e94166e9c1839b83b8.jpg' Traceback (most recent call last):

File "C:\Users\wxpie\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, │ │ └ {'name': 'main', 'doc': None, 'package': '', 'loader': <zipimporter object "c:\ai\pdf_mark\venv\Scripts\m... │ └ <code object at 0x0000016A15988240, file "c:\ai\pdf_mark\venv\Scripts\magic-pdf.exe__main__.py", line 1> └ <function _run_code at 0x0000016A1595E560>

File "C:\Users\wxpie\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) │ └ {'name': 'main', 'doc': None, 'package': '', 'loader': <zipimporter object "c:\ai\pdf_mark\venv\Scripts\m... └ <code object at 0x0000016A15988240, file "c:\ai\pdf_mark\venv\Scripts\magic-pdf.exe__main__.py", line 1>

File "c:\ai\pdf_mark\venv\Scripts\magic-pdf.exe__main__.py", line 7, in sys.exit(cli()) │ │ └ │ └ └ <module 'sys' (built-in)>

File "c:\ai\pdf_mark\venv\lib\site-packages\click\core.py", line 1157, in call return self.main(*args, **kwargs) │ │ │ └ {} │ │ └ () │ └ <function BaseCommand.main at 0x0000016A15DE9E10> └

File "c:\ai\pdf_mark\venv\lib\site-packages\click\core.py", line 1078, in main rv = self.invoke(ctx) │ │ └ <click.core.Context object at 0x0000016A159C4EB0> │ └ <function Command.invoke at 0x0000016A15DEA8C0> └

File "c:\ai\pdf_mark\venv\lib\site-packages\click\core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) │ │ │ │ │ └ {'path': 'C:\download\pdf\11\222\test\Braunwald_s_Heart_DiseaseA_Textbook_of_Cardiovascular_Medicinesplit_2', 'outp... │ │ │ │ └ <click.core.Context object at 0x0000016A159C4EB0> │ │ │ └ <function cli at 0x0000016A5C782200> │ │ └ │ └ <function Context.invoke at 0x0000016A15DE9630> └ <click.core.Context object at 0x0000016A159C4EB0>

File "c:\ai\pdf_mark\venv\lib\site-packages\click\core.py", line 783, in invoke return __callback(*args, **kwargs) │ └ {'path': 'C:\download\pdf\11\222\test\Braunwald_s_Heart_DiseaseA_Textbook_of_Cardiovascular_Medicinesplit_2', 'outp... └ ()

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\tools\cli.py", line 100, in cli parse_doc(doc_path) │ └ WindowsPath('C:/download/pdf/11/222/test/Braunwald_s_Heart_DiseaseA_Textbook_of_Cardiovascular_Medicinesplit2/Braunwald... └ <function cli..parse_doc at 0x0000016A159CCB80>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\tools\cli.py", line 84, in parse_doc do_parse( └ <function do_parse at 0x0000016A5C781990>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\tools\common.py", line 85, in do_parse pipe.pipe_parse() │ └ <function UNIPipe.pipe_parse at 0x0000016A5C781BD0> └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x0000016A7CC3F280>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\pipe\UNIPipe.py", line 38, in pipe_parse self.pdf_mid_data = parse_union_pdf(self.pdf_bytes, self.model_list, self.image_writer, │ │ │ │ │ │ │ │ └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x0000016A791456F0> │ │ │ │ │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x0000016A7CC3F280> │ │ │ │ │ │ └ [{'layout_dets': [{'category_id': 0, 'poly': [10.797889709472656, 1376.1629638671875, 579.42041015625, 1376.1629638671875, 57... │ │ │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x0000016A7CC3F280> │ │ │ │ └ b'%PDF-1.7\n%\xc2\xb5\xc2\xb6\n\n1 0 obj\n<</Type/Catalog/Pages 2 0 R>>\nendobj\n\n2 0 obj\n<</Type/Pages/Count 80/Kids[37 0 ... │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x0000016A7CC3F280> │ │ └ <function parse_union_pdf at 0x0000016A5C7811B0> │ └ None └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x0000016A7CC3F280>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\user_api.py", line 101, in parse_union_pdf pdf_info_dict = parse_pdf(parse_pdf_by_ocr) │ └ <function parse_pdf_by_ocr at 0x0000016A3A3B9D80> └ <function parse_union_pdf..parse_pdf at 0x0000016A9EC7BC70>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\user_api.py", line 82, in parse_pdf return method( └ <function parse_pdf_by_ocr at 0x0000016A3A3B9D80>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\pdf_parse_by_ocr.py", line 11, in parse_pdf_by_ocr return pdf_parse_union(pdf_bytes, │ └ b'%PDF-1.7\n%\xc2\xb5\xc2\xb6\n\n1 0 obj\n<</Type/Catalog/Pages 2 0 R>>\nendobj\n\n2 0 obj\n<</Type/Pages/Count 80/Kids[37 0 ... └ <function pdf_parse_union at 0x0000016A5C780EE0>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\pdf_parse_union_core.py", line 249, in pdf_parse_union page_info = parse_page_core(pdf_docs, magic_model, page_id, pdf_bytes_md5, imageWriter, parse_mode) │ │ │ │ │ │ └ 'ocr' │ │ │ │ │ └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x0000016A791456F0> │ │ │ │ └ 'B7BDFC834A92FCBCFEB8ACF7B733A395' │ │ │ └ 0 │ │ └ <magic_pdf.model.magic_model.MagicModel object at 0x0000016A791465F0> │ └ Document('', <memory, doc# 49>) └ <function parse_page_core at 0x0000016A5C780E50>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\pdf_parse_union_core.py", line 128, in parse_page_core spans = ocr_cut_image_and_table(spans, pdf_docs[page_id], page_id, pdf_bytes_md5, imageWriter) │ │ │ │ │ │ └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x0000016A791456F0> │ │ │ │ │ └ 'B7BDFC834A92FCBCFEB8ACF7B733A395' │ │ │ │ └ 0 │ │ │ └ 0 │ │ └ Document('', <memory, doc# 49>) │ └ [{'bbox': [3, 64, 596, 133], 'score': 0.993672788143158, 'type': 'table'}, {'bbox': [298, 538, 326, 553], 'score': 0.87, 'con... └ <function ocr_cut_image_and_table at 0x0000016A5C766290>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\pre_proc\cut_image.py", line 22, in ocr_cut_image_and_table span['image_path'] = cut_image(span['bbox'], page_id, page, return_path=return_path('tables'), │ │ │ │ │ └ <function ocr_cut_image_and_table..return_path at 0x0000016AA39440D0> │ │ │ │ └ page 0 of <memory, doc# 49> │ │ │ └ 0 │ │ └ {'bbox': [3, 64, 596, 133], 'score': 0.993672788143158, 'type': 'table'} │ └ <function cut_image at 0x0000016A5C766440> └ {'bbox': [3, 64, 596, 133], 'score': 0.993672788143158, 'type': 'table'}

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\libs\pdf_image_tools.py", line 31, in cut_image imageWriter.write(byte_data, img_hash256_path, AbsReaderWriter.MODE_BIN) │ │ │ │ │ └ 'binary' │ │ │ │ └ <class 'magic_pdf.rw.AbsReaderWriter.AbsReaderWriter'> │ │ │ └ 'e6a42f9b1b5b49c9f8f6810f6a1f8e562f0c407c3c2b88e94166e9c1839b83b8.jpg' │ │ └ b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00\x00\x00\x00\xff\xdb\x00C\x00\x02\x01\x01\x01\x01\x01\x02\x01\x01\x01\x02... │ └ <function DiskReaderWriter.write at 0x0000016A17D7D120> └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x0000016A791456F0>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\rw\DiskReaderWriter.py", line 41, in write with open(abspath, "wb") as f: └ 'C:\download\pdf\11\222\output\Braunwald_s_Heart_DiseaseA_Textbook_of_Cardiovascular_Medicinesplit_2_1total\Braunw...

FileNotFoundError: [Errno 2] No such file or directory: 'C:\download\pdf\11\222\output\Braunwald_s_Heart_DiseaseA_Textbook_of_Cardiovascular_Medicinesplit_2_1total\Braunwald_s_Heart_DiseaseA_Textbook_of_Cardiovascular_Medicinesplit_2_split_9\auto\images\e6a42f9b1b5b49c9f8f6810f6a1f8e562f0c407c3c2b88e94166e9c1839b83b8.jpg' 2024-09-29 06:14:37.923 | ERROR | magic_pdf.tools.cli:parse_doc:96 - Both parse_pdf_by_txt and parse_pdf_by_ocr failed. Traceback (most recent call last):

File "C:\Users\wxpie\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, │ │ └ {'name': 'main', 'doc': None, 'package': '', 'loader': <zipimporter object "c:\ai\pdf_mark\venv\Scripts\m... │ └ <code object at 0x0000016A15988240, file "c:\ai\pdf_mark\venv\Scripts\magic-pdf.exe__main__.py", line 1> └ <function _run_code at 0x0000016A1595E560>

File "C:\Users\wxpie\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) │ └ {'name': 'main', 'doc': None, 'package': '', 'loader': <zipimporter object "c:\ai\pdf_mark\venv\Scripts\m... └ <code object at 0x0000016A15988240, file "c:\ai\pdf_mark\venv\Scripts\magic-pdf.exe__main__.py", line 1>

File "c:\ai\pdf_mark\venv\Scripts\magic-pdf.exe__main__.py", line 7, in sys.exit(cli()) │ │ └ │ └ └ <module 'sys' (built-in)>

File "c:\ai\pdf_mark\venv\lib\site-packages\click\core.py", line 1157, in call return self.main(*args, **kwargs) │ │ │ └ {} │ │ └ () │ └ <function BaseCommand.main at 0x0000016A15DE9E10> └

File "c:\ai\pdf_mark\venv\lib\site-packages\click\core.py", line 1078, in main rv = self.invoke(ctx) │ │ └ <click.core.Context object at 0x0000016A159C4EB0> │ └ <function Command.invoke at 0x0000016A15DEA8C0> └

File "c:\ai\pdf_mark\venv\lib\site-packages\click\core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) │ │ │ │ │ └ {'path': 'C:\download\pdf\11\222\test\Braunwald_s_Heart_DiseaseA_Textbook_of_Cardiovascular_Medicinesplit_2', 'outp... │ │ │ │ └ <click.core.Context object at 0x0000016A159C4EB0> │ │ │ └ <function cli at 0x0000016A5C782200> │ │ └ │ └ <function Context.invoke at 0x0000016A15DE9630> └ <click.core.Context object at 0x0000016A159C4EB0>

File "c:\ai\pdf_mark\venv\lib\site-packages\click\core.py", line 783, in invoke return __callback(*args, **kwargs) │ └ {'path': 'C:\download\pdf\11\222\test\Braunwald_s_Heart_DiseaseA_Textbook_of_Cardiovascular_Medicinesplit_2', 'outp... └ ()

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\tools\cli.py", line 100, in cli parse_doc(doc_path) │ └ WindowsPath('C:/download/pdf/11/222/test/Braunwald_s_Heart_DiseaseA_Textbook_of_Cardiovascular_Medicinesplit2/Braunwald... └ <function cli..parse_doc at 0x0000016A159CCB80>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\tools\cli.py", line 84, in parse_doc do_parse( └ <function do_parse at 0x0000016A5C781990>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\tools\common.py", line 85, in do_parse pipe.pipe_parse() │ └ <function UNIPipe.pipe_parse at 0x0000016A5C781BD0> └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x0000016A7CC3F280>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\pipe\UNIPipe.py", line 38, in pipe_parse self.pdf_mid_data = parse_union_pdf(self.pdf_bytes, self.model_list, self.image_writer, │ │ │ │ │ │ │ │ └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x0000016A791456F0> │ │ │ │ │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x0000016A7CC3F280> │ │ │ │ │ │ └ [{'layout_dets': [{'category_id': 0, 'poly': [10.797889709472656, 1376.1629638671875, 579.42041015625, 1376.1629638671875, 57... │ │ │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x0000016A7CC3F280> │ │ │ │ └ b'%PDF-1.7\n%\xc2\xb5\xc2\xb6\n\n1 0 obj\n<</Type/Catalog/Pages 2 0 R>>\nendobj\n\n2 0 obj\n<</Type/Pages/Count 80/Kids[37 0 ... │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x0000016A7CC3F280> │ │ └ <function parse_union_pdf at 0x0000016A5C7811B0> │ └ None └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x0000016A7CC3F280>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\user_api.py", line 103, in parse_union_pdf raise Exception("Both parse_pdf_by_txt and parse_pdf_by_ocr failed.")

Exception: Both parse_pdf_by_txt and parse_pdf_by_ocr failed.

(venv) C:\ai>

Operating system | 操作系统

Windows

Python version | Python 版本

3.10.11

Software version | 软件版本 (magic-pdf --version)

0.8.x

Device mode | 设备模式

cpu

myhloli commented 2 months ago

可能是windows不支持超长路径的原因,可以试试把路径改短一些