opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
https://opendatalab.com/OpenSourceTools?tool=extract
GNU Affero General Public License v3.0
18.83k stars 1.35k forks source link

内存泄漏导致进程被杀死 #1070

Closed Cokejia closed 1 day ago

Cokejia commented 5 days ago

Description of the bug | 错误描述

在处理一个PDF页数为379页的文件时,模型会不断占用内存,达到一定量后被oom杀死(本地机子上也会不断占用内存,最后出现MemoryError)。 magic-pdf --version 0.10.0 运行时内存占用如下: image 显存占用如下: image

求大佬赐教,非常着急,感谢感谢!!!

How to reproduce the bug | 如何复现

以下是日志文件内容: WARNING: OMP_NUM_THREADS set to 12, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads. PLEASE USE OMP_NUM_THREADS WISELY. import tensorrt_llm failed, if do not use tensorrt, ignore this message import lmdeploy failed, if do not use lmdeploy, ignore this message 2024-11-23 20:56:07.879 INFO magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 10, cid_chars_radio: 0.0 2024-11-23 20:56:07.880 WARNING magic_pdf.filter.pdf_classify_by_type:classify:334 - pdf is not classified by area and text_len, by_image_area: False, by_text: False, by_avg_words: False, by_img_num: True, by_text_layout: False, by_img_narrow_strips: True, by_invalid_chars: True 2024-11-23 20:56:07.882 INFO magic_pdf.model.pdf_extract_kit:init:78 - DocAnalysis init, this may take some times, layout_model: layoutlmv3, apply_formula: True, apply_ocr: True, apply_table: False, table_model: rapid_table, lang: None 2024-11-23 20:56:07.882 INFO magic_pdf.model.pdf_extract_kit:init:91 - using device: cuda 2024-11-23 20:56:07.882 INFO magic_pdf.model.pdf_extract_kit:init:95 - using models_dir: /root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit-1___0/models CustomVisionEncoderDecoderModel init VariableUnimerNetModel init VariableUnimerNetPatchEmbeddings init VariableUnimerNetModel init VariableUnimerNetPatchEmbeddings init CustomMBartForCausalLM init CustomMBartDecoder init [11/23 20:56:18 detectron2]: Rank of current process: 0. World size: 1 [11/23 20:56:18 detectron2]: Environment info:

sys.platform linux Python 3.10.15 (main, Oct 3 2024, 07:27:34) [GCC 11.2.0] numpy 1.26.4 detectron2 0.6 @/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/detectron2 Compiler GCC 11.4 CUDA compiler not available DETECTRON2_ENV_MODULE PyTorch 2.3.1+cu121 @/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/torch PyTorch debug build False torch._C._GLIBCXX_USE_CXX11_ABI False GPU available Yes GPU 0 NVIDIA GeForce RTX 3080 (arch=8.6) Driver version 550.120 CUDA_HOME /usr/local/cuda Pillow 11.0.0 torchvision 0.18.1+cu121 @/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/torchvision torchvision arch flags 5.0, 6.0, 7.0, 7.5, 8.0, 8.6, 9.0 fvcore 0.1.5.post20221221 iopath 0.1.9 cv2 4.6.0


PyTorch built with:

[11/23 20:56:18 detectron2]: Command line arguments: {'config_file': '/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml', 'resume': False, 'eval_only': False, 'num_gpus': 1, 'num_machines': 1, 'machine_rank': 0, 'dist_url': 'tcp://127.0.0.1:57823', 'opts': ['MODEL.WEIGHTS', '/root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit-1___0/models/Layout/LayoutLMv3/model_final.pth']} [11/23 20:56:18 detectron2]: Contents of args.config_file=/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml: AUG: DETR: true CACHE_DIR: ~/cache/huggingface CUDNN_BENCHMARK: false DATALOADER: ASPECT_RATIO_GROUPING: true FILTER_EMPTY_ANNOTATIONS: false NUM_WORKERS: 4 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: [] PROPOSAL_FILES_TRAIN: [] TEST:

[11/23 20:56:21 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit-1_0/models/Layout/LayoutLMv3/modelfinal.pth ... [11/23 20:56:21 fvcore.common.checkpoint]: [Checkpointer] Loading from /root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit-10/models/Layout/LayoutLMv3/model_final.pth ... 2024-11-23 20:56:23.160 | INFO | magic_pdf.model.pdf_extract_kit:init:170 - DocAnalysis init done! 2024-11-23 20:56:23.161 | INFO | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:131 - model init cost: 15.280268430709839 Killed

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.9.x

Device mode | 设备模式

cuda

Cokejia commented 5 days ago

内存泄露,占用到一定量后,进程被杀死: image image 运行上述的服务器配置: 镜像 PyTorch 2.1.2 Python 3.10(ubuntu22.04) Cuda 11.8 GPU RTX 3090(24GB) * 1 CPU 12 vCPU Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz 内存 72GB

myhloli commented 5 days ago

装paddlegpu了吗?

Cokejia commented 4 days ago

装paddlegpu了吗?

装了

myhloli commented 4 days ago

出问题的文件可以上传一份给我们复测吗

Cokejia commented 2 days ago

出问题的文件可以上传一份给我们复测吗

已发送至您的邮箱moe@myhloli.com,请注意查收。 请您在运行过程中,重点关注一下内存的使用量。

myhloli commented 2 days ago

试了下你这个扫描版的pdf分辨率太高了,截图处理的时候把内存爆掉了,后面我们调整下逻辑对分辨率过大的pdf不做缩放处理就可以了

Cokejia commented 2 days ago

试了下你这个扫描版的pdf分辨率太高了,截图处理的时候把内存爆掉了,后面我们调整下逻辑对分辨率过大的pdf不做缩放处理就可以了

好的,大佬。请问这个不做缩放处理具体怎么操作呢?如何在代码里更改呢?

myhloli commented 2 days ago

https://github.com/opendatalab/MinerU/pull/1106/files

Cokejia commented 2 days ago

https://github.com/opendatalab/MinerU/pull/1106/files

好的,谢谢大佬。但是我是直接通过命令行调用的模型,请问源代码存储在本地的哪个文件目录下呢? image

myhloli commented 2 days ago

在你的conda安装目录里

Cokejia commented 2 days ago

在你的conda安装目录里

好的,已找到,非常感谢!