Open laulguo opened 2 weeks ago
看着是内存不足了,可以把页数太多的pdf分割成几份小的再处理,或者给机器加下内存。
看着是内存不足了,可以把页数太多的pdf分割成几份小的再处理,或者给机器加下内存。
是的,我就是这么处理的。。。 另外扫描的参数如何调整呢?我这里出现了大量的这样的情况 很多这样的最后一行被排除掉了
ocr漏行的问题在dev分支修复了,有需求的话可以clone dev分支并使用源码安装再试试
ocr漏行的问题在dev分支修复了,有需求的话可以clone dev分支并使用源码安装再试试
安装方式是不是直接覆盖一下“magic_pdf”即可、、
ocr漏行的问题在dev分支修复了,有需求的话可以clone dev分支并使用源码安装再试试
安装方式是不是直接覆盖一下“magic_pdf”即可、、
python setup.py install用这个
Description of the bug | 错误描述
Unable to allocate 41.9 MiB for an array with shape (6, 276, 6625) and data type float32
How to reproduce the bug | 如何复现
我在跑一个316页的PDF时出现这个BUG,看起来似乎是文件过大? 2024-08-25 23:55:25.082 | ERROR | magic_pdf.tools.cli:parse_doc:69 - Unable to allocate 41.9 MiB for an array with shape (6, 276, 6625) and data type float32 Traceback (most recent call last):
File "D:\anaconda3\envs\MinerU\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, │ │ └ {'name': 'main', 'doc': None, 'package': '', 'loader': <zipimporter object "D:\anaconda3\envs\MinerU\Scri... │ └ <code object at 0x00000144A2457D60, file "D:\anaconda3\envs\MinerU\Scripts\magic-pdf.exe__main__.py", line 1>
└ <function _run_code at 0x00000144A2440DC0>
File "D:\anaconda3\envs\MinerU\lib\runpy.py", line 86, in _run_code exec(code, run_globals) │ └ {'name': 'main', 'doc': None, 'package': '', 'loader': <zipimporter object "D:\anaconda3\envs\MinerU\Scri... └ <code object at 0x00000144A2457D60, file "D:\anaconda3\envs\MinerU\Scripts\magic-pdf.exe__main__.py", line 1>
File "D:\anaconda3\envs\MinerU\Scripts\magic-pdf.exe__main__.py", line 7, in
sys.exit(cli())
│ │ └
│ └
└ <module 'sys' (built-in)>
File "D:\anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1157, in call return self.main(*args, **kwargs) │ │ │ └ {} │ │ └ () │ └ <function BaseCommand.main at 0x00000144A4062A70> └
File "D:\anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1078, in main rv = self.invoke(ctx) │ │ └ <click.core.Context object at 0x00000144A2194D90> │ └ <function Command.invoke at 0x00000144A4063520> └
File "D:\anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) │ │ │ │ │ └ {'path': 'shang.pdf', 'output_dir': '', 'method': 'auto'} │ │ │ │ └ <click.core.Context object at 0x00000144A2194D90> │ │ │ └ <function cli at 0x00000144D90CF250> │ │ └
│ └ <function Context.invoke at 0x00000144A4062290>
└ <click.core.Context object at 0x00000144A2194D90>
File "D:\anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 783, in invoke return __callback(*args, **kwargs) │ └ {'path': 'shang.pdf', 'output_dir': '', 'method': 'auto'} └ ()
File "D:\anaconda3\envs\MinerU\lib\site-packages\magic_pdf\tools\cli.py", line 75, in cli parse_doc(path) │ └ 'shang.pdf' └ <function cli..parse_doc at 0x00000144A249F370>
File "D:\anaconda3\envs\MinerU\lib\site-packages\magic_pdf\tools\common.py", line 65, in do_parse pipe.pipe_analyze() │ └ <function UNIPipe.pipe_analyze at 0x00000144D90CE440> └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x00000144D90A36A0>
File "D:\anaconda3\envs\MinerU\lib\site-packages\magic_pdf\pipe\UNIPipe.py", line 31, in pipe_analyze self.model_list = doc_analyze(self.pdf_bytes, ocr=True) │ │ │ │ └
│ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x00000144D90A36A0>
│ │ └ <function doc_analyze at 0x00000144A5026950>
│ └ []
└ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x00000144D90A36A0>
File "D:\anaconda3\envs\MinerU\lib\site-packages\magic_pdf\model\doc_analyze_by_custom_model.py", line 120, in doc_analyze result = custom_model(img) │ └ array([[[255, 255, 255], │ [255, 255, 255], │ [255, 255, 255], │ ..., │ [255, 255, 255], │ [255... └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x00000144F0EC3DF0>
File "D:\anaconda3\envs\MinerU\lib\site-packages\magic_pdf\model\pdf_extract_kit.py", line 250, in call ocr_res = self.ocr_model.ocr(new_image, mfd_res=adjusted_mfdetrec_res)[0] │ │ │ │ └ [] │ │ │ └ array([[[255, 255, 255], │ │ │ [255, 255, 255], │ │ │ [255, 255, 255], │ │ │ ..., │ │ │ [255, 255, 255], │ │ │ [255... │ │ └ <function ModifiedPaddleOCR.ocr at 0x00000144966CF520> │ └ <magic_pdf.model.pek_sub_modules.self_modify.ModifiedPaddleOCR object at 0x00000144D90E9510> └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x00000144F0EC3DF0>
File "D:\anaconda3\envs\MinerU\lib\site-packages\magic_pdf\model\pek_sub_modules\self_modify.py", line 209, in ocr dt_boxes, recres, = self.call(img, cls, mfd_res=mfd_res) │ │ │ │ └ [] │ │ │ └ True │ │ └ array([[[255, 255, 255], │ │ [255, 255, 255], │ │ [255, 255, 255], │ │ ..., │ │ [255, 255, 255], │ │ [255... │ └ <function ModifiedPaddleOCR.call at 0x00000144966CF5B0> └ <magic_pdf.model.pek_sub_modules.self_modify.ModifiedPaddleOCR object at 0x00000144D90E9510>
File "D:\anaconda3\envs\MinerU\lib\site-packages\magic_pdf\model\pek_sub_modules\self_modify.py", line 289, in call rec_res, elapse = self.text_recognizer(img_crop_list) │ │ └ [array([[[255, 255, 255], │ │ [255, 255, 255], │ │ [255, 255, 255], │ │ ..., │ │ [255, 255, 255], │ │ [25... │ └ <tools.infer.predict_rec.TextRecognizer object at 0x0000014497B8B640> └ <magic_pdf.model.pek_sub_modules.self_modify.ModifiedPaddleOCR object at 0x00000144D90E9510>
File "D:\anaconda3\envs\MinerU\lib\site-packages\paddleocr\tools\infer\predict_rec.py", line 619, in call output = output_tensor.copy_to_cpu() │ └ <instancemethod copy_to_cpu at 0x000001448D521BA0> └ <paddle.base.libpaddle.PaddleInferTensor object at 0x0000014497BF7CB0>
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 41.9 MiB for an array with shape (6, 276, 6625) and data type float32
(MinerU) D:\MinerU>
Operating system | 操作系统
Windows
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.7.x
Device mode | 设备模式
cuda