opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.19k stars 835 forks source link

Unable to allocate 41.9 MiB for an array with shape (6, 276, 6625) and data type float32 #484

Open laulguo opened 2 weeks ago

laulguo commented 2 weeks ago

Description of the bug | 错误描述

Unable to allocate 41.9 MiB for an array with shape (6, 276, 6625) and data type float32

How to reproduce the bug | 如何复现

我在跑一个316页的PDF时出现这个BUG,看起来似乎是文件过大? 2024-08-25 23:55:25.082 | ERROR | magic_pdf.tools.cli:parse_doc:69 - Unable to allocate 41.9 MiB for an array with shape (6, 276, 6625) and data type float32 Traceback (most recent call last):

File "D:\anaconda3\envs\MinerU\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, │ │ └ {'name': 'main', 'doc': None, 'package': '', 'loader': <zipimporter object "D:\anaconda3\envs\MinerU\Scri... │ └ <code object at 0x00000144A2457D60, file "D:\anaconda3\envs\MinerU\Scripts\magic-pdf.exe__main__.py", line 1> └ <function _run_code at 0x00000144A2440DC0>

File "D:\anaconda3\envs\MinerU\lib\runpy.py", line 86, in _run_code exec(code, run_globals) │ └ {'name': 'main', 'doc': None, 'package': '', 'loader': <zipimporter object "D:\anaconda3\envs\MinerU\Scri... └ <code object at 0x00000144A2457D60, file "D:\anaconda3\envs\MinerU\Scripts\magic-pdf.exe__main__.py", line 1>

File "D:\anaconda3\envs\MinerU\Scripts\magic-pdf.exe__main__.py", line 7, in sys.exit(cli()) │ │ └ │ └ └ <module 'sys' (built-in)>

File "D:\anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1157, in call return self.main(*args, **kwargs) │ │ │ └ {} │ │ └ () │ └ <function BaseCommand.main at 0x00000144A4062A70> └

File "D:\anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1078, in main rv = self.invoke(ctx) │ │ └ <click.core.Context object at 0x00000144A2194D90> │ └ <function Command.invoke at 0x00000144A4063520> └

File "D:\anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) │ │ │ │ │ └ {'path': 'shang.pdf', 'output_dir': '', 'method': 'auto'} │ │ │ │ └ <click.core.Context object at 0x00000144A2194D90> │ │ │ └ <function cli at 0x00000144D90CF250> │ │ └ │ └ <function Context.invoke at 0x00000144A4062290> └ <click.core.Context object at 0x00000144A2194D90>

File "D:\anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 783, in invoke return __callback(*args, **kwargs) │ └ {'path': 'shang.pdf', 'output_dir': '', 'method': 'auto'} └ ()

File "D:\anaconda3\envs\MinerU\lib\site-packages\magic_pdf\tools\cli.py", line 75, in cli parse_doc(path) │ └ 'shang.pdf' └ <function cli..parse_doc at 0x00000144A249F370>

File "D:\anaconda3\envs\MinerU\lib\site-packages\magic_pdf\tools\cli.py", line 60, in parse_doc do_parse( └ <function do_parse at 0x00000144D90CE680>

File "D:\anaconda3\envs\MinerU\lib\site-packages\magic_pdf\tools\common.py", line 65, in do_parse pipe.pipe_analyze() │ └ <function UNIPipe.pipe_analyze at 0x00000144D90CE440> └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x00000144D90A36A0>

File "D:\anaconda3\envs\MinerU\lib\site-packages\magic_pdf\pipe\UNIPipe.py", line 31, in pipe_analyze self.model_list = doc_analyze(self.pdf_bytes, ocr=True) │ │ │ │ └ │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x00000144D90A36A0> │ │ └ <function doc_analyze at 0x00000144A5026950> │ └ [] └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x00000144D90A36A0>

File "D:\anaconda3\envs\MinerU\lib\site-packages\magic_pdf\model\doc_analyze_by_custom_model.py", line 120, in doc_analyze result = custom_model(img) │ └ array([[[255, 255, 255], │ [255, 255, 255], │ [255, 255, 255], │ ..., │ [255, 255, 255], │ [255... └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x00000144F0EC3DF0>

File "D:\anaconda3\envs\MinerU\lib\site-packages\magic_pdf\model\pdf_extract_kit.py", line 250, in call ocr_res = self.ocr_model.ocr(new_image, mfd_res=adjusted_mfdetrec_res)[0] │ │ │ │ └ [] │ │ │ └ array([[[255, 255, 255], │ │ │ [255, 255, 255], │ │ │ [255, 255, 255], │ │ │ ..., │ │ │ [255, 255, 255], │ │ │ [255... │ │ └ <function ModifiedPaddleOCR.ocr at 0x00000144966CF520> │ └ <magic_pdf.model.pek_sub_modules.self_modify.ModifiedPaddleOCR object at 0x00000144D90E9510> └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x00000144F0EC3DF0>

File "D:\anaconda3\envs\MinerU\lib\site-packages\magic_pdf\model\pek_sub_modules\self_modify.py", line 209, in ocr dt_boxes, recres, = self.call(img, cls, mfd_res=mfd_res) │ │ │ │ └ [] │ │ │ └ True │ │ └ array([[[255, 255, 255], │ │ [255, 255, 255], │ │ [255, 255, 255], │ │ ..., │ │ [255, 255, 255], │ │ [255... │ └ <function ModifiedPaddleOCR.call at 0x00000144966CF5B0> └ <magic_pdf.model.pek_sub_modules.self_modify.ModifiedPaddleOCR object at 0x00000144D90E9510>

File "D:\anaconda3\envs\MinerU\lib\site-packages\magic_pdf\model\pek_sub_modules\self_modify.py", line 289, in call rec_res, elapse = self.text_recognizer(img_crop_list) │ │ └ [array([[[255, 255, 255], │ │ [255, 255, 255], │ │ [255, 255, 255], │ │ ..., │ │ [255, 255, 255], │ │ [25... │ └ <tools.infer.predict_rec.TextRecognizer object at 0x0000014497B8B640> └ <magic_pdf.model.pek_sub_modules.self_modify.ModifiedPaddleOCR object at 0x00000144D90E9510>

File "D:\anaconda3\envs\MinerU\lib\site-packages\paddleocr\tools\infer\predict_rec.py", line 619, in call output = output_tensor.copy_to_cpu() │ └ <instancemethod copy_to_cpu at 0x000001448D521BA0> └ <paddle.base.libpaddle.PaddleInferTensor object at 0x0000014497BF7CB0>

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 41.9 MiB for an array with shape (6, 276, 6625) and data type float32

(MinerU) D:\MinerU>

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.7.x

Device mode | 设备模式

cuda

myhloli commented 2 weeks ago

看着是内存不足了,可以把页数太多的pdf分割成几份小的再处理,或者给机器加下内存。

laulguo commented 2 weeks ago

看着是内存不足了,可以把页数太多的pdf分割成几份小的再处理,或者给机器加下内存。

是的,我就是这么处理的。。。 另外扫描的参数如何调整呢?我这里出现了大量的这样的情况 图片 很多这样的最后一行被排除掉了

myhloli commented 2 weeks ago

ocr漏行的问题在dev分支修复了,有需求的话可以clone dev分支并使用源码安装再试试

laulguo commented 2 weeks ago

ocr漏行的问题在dev分支修复了,有需求的话可以clone dev分支并使用源码安装再试试

安装方式是不是直接覆盖一下“magic_pdf”即可、、

laulguo commented 1 week ago

ocr漏行的问题在dev分支修复了,有需求的话可以clone dev分支并使用源码安装再试试

安装方式是不是直接覆盖一下“magic_pdf”即可、、

python setup.py install用这个