微调minicpm-v2_6模型OOM报错

环境
flash-attn @ file:///ssd2/flash_attn-2.6.3%2Bcu118torch2.1cxx11abiFALSE-cp39-cp39-linux_x86_64.whl#sha256=b9e51701e981d3c8df0988174a76b8e865027daea2c006e609c39f0fbeba7a2e
torch==2.1.2+cu118
torchaudio==2.1.2+cu118
torchvision==0.16.2+cu118
cuda==11.8
机器==A800，显存80G

训练脚本和日志
run sh: `torchrun --nproc_per_node 4 --master_port 29501 /ssd2//swift-mine/swift/cli/sft.py --model_type minicpm-v-v2_6-chat --model_id_or_path /ssd2/MLLM/models/MiniCPM-V-2_6 --sft_type lora --target_regex llm\..*layers\.\d+\.self_attn\.(q_proj|k_proj|v_proj|o_proj) --tuner_backend peft --modules_to_save embed_tokens resampler vpm --template_type AUTO --dtype AUTO --output_dir output/test --dataset custom_video --train_dataset_sample -1 --num_train_epochs 1 --max_length 4096 --check_dataset_strategy warning --lora_rank 64 --lora_alpha 64 --lora_dropout_p 0.05 --gradient_checkpointing true --batch_size 1 --learning_rate 2e-6 --gradient_accumulation_steps 16 --max_grad_norm 0.5 --warmup_ratio 0.01 --weight_decay 0.1 --save_total_limit 1 --save_strategy epoch --logging_steps 1 --dataset_test_ratio 0 --use_flash_attn false --deepspeed default-zero2`

[INFO:swift] target_modules: ^(llm|resampler)(?!.*(lm_head|output|emb|wte|shared)).*
[INFO:swift] modules_to_save: ['embed_tokens', 'resampler', 'vpm']
[INFO:swift] Value of target_modules: ^(llm|resampler)(?!.*(lm_head|output|emb|wte|shared)).* will have no effect because target_regex value: llm\..*layers\.\d+\.self_attn\.(q_proj|k_proj|v_proj|o_proj) exists.
Map:   0%|                                                                                                                                                                                  | 0/146200 [00:00<?, ? examples/s][INFO:swift] lora_config: get_wrapped_class.<locals>.PeftWrapper(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path='/ssd2//MLLM/models/MiniCPM-V-2_6', revision=None, task_type='CAUSAL_LM', inference_mode=False, r=64, target_modules='llm\\..*layers\\.\\d+\\.self_attn\\.(q_proj|k_proj|v_proj|o_proj)', lora_alpha=64, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=['embed_tokens', 'resampler', 'vpm'], init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, use_dora=False, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False), lora_dtype=None, lorap_lr_ratio=None, lorap_emb_lr=1e-06)
[INFO:swift] PeftModelForCausalLM: 9169.5278M Params (1070.3526M Trainable [11.6729%]), 270.0060M Buffers.
[INFO:swift] Setting model.config.use_cache: False

报错信息
Traceback (most recent call last):
  File "/ssd2//MLLM/swift-mine/swift/cli/sft.py", line 5, in <module>
    sft_main()
  File "/ssd2/MLLM/swift-mine/swift/utils/run_utils.py", line 32, in x_main
    result = llm_x(args, **kwargs)
  File "/ssd2/MLLM/swift-mine/swift/llm/sft.py", line 342, in llm_sft
    td0, tkwargs0 = template.encode(train_dataset[0])
TypeError: cannot unpack non-iterable NoneType object
[INFO:swift] ['Traceback (most recent call last):\n', '  File "/ssd2//MLLM/swift-mine/swift/llm/utils/template.py", line 459, in encode\n    res = _encode(example) if not streaming else _encode(example)[0]\n', '  File "/ssd2//MLLM/swift-mine/swift/llm/utils/template.py", line 2714, in _encode\n    inputs_embeds, _ = self.model.get_vllm_embedding(data)\n', '  File "/home//.cache/huggingface/modules/transformers_modules/MiniCPM-V-2_6/modeling_minicpmv.py", line 107, in get_vllm_embedding\n    vision_embedding = self.vpm(all_pixel_values.type(dtype), patch_attention_mask=patch_attn_mask, tgt_sizes=tgt_sizes).last_hidden_state\n', '  File "/ssd2//MLLM/swift-mine/env39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl\n    return self._call_impl(*args, **kwargs)\n', '  File "/ssd2//MLLM/swift-mine/env39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl\n    return forward_call(*args, **kwargs)\n', '  File "/ssd2//MLLM/swift-mine/env39/lib/python3.9/site-packages/peft/utils/other.py", line 264, in forward\n    return self.modules_to_save[self.active_adapter](*args, **kwargs)\n', '  File "/ssd2//MLLM/swift-mine/env39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl\n    return self._call_impl(*args, **kwargs)\n', '  File "/ssd2//MLLM/swift-mine/env39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl\n    return forward_call(*args, **kwargs)\n', '  File "/home//.cache/huggingface/modules/transformers_modules/MiniCPM-V-2_6/modeling_navit_siglip.py", line 918, in forward\n    encoder_outputs = self.encoder(\n', '  File "/ssd2//MLLM/swift-mine/env39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl\n    return self._call_impl(*args, **kwargs)\n', '  File "/ssd2//MLLM/swift-mine/env39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl\n    return forward_call(*args, **kwargs)\n', '  File "/home//.cache/huggingface/modules/transformers_modules/MiniCPM-V-2_6/modeling_navit_siglip.py", line 826, in forward\n    layer_outputs = encoder_layer(\n', '  File "/ssd2//MLLM/swift-mine/env39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl\n    return self._call_impl(*args, **kwargs)\n', '  File "/ssd2//MLLM/swift-mine/env39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl\n    return forward_call(*args, **kwargs)\n', '  File "/home//.cache/huggingface/modules/transformers_modules/MiniCPM-V-2_6/modeling_navit_siglip.py", line 670, in forward\n    hidden_states, attn_weights = self.self_attn(\n', '  File "/ssd2//MLLM/swift-mine/env39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl\n    return self._call_impl(*args, **kwargs)\n', '  File "/ssd2//MLLM/swift-mine/env39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl\n    return forward_call(*args, **kwargs)\n', '  File "/home//.cache/huggingface/modules/transformers_modules/MiniCPM-V-2_6/modeling_navit_siglip.py", line 415, in forward\n    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)\n', '  File "/ssd2//MLLM/swift-mine/env39/lib/python3.9/site-packages/torch/nn/functional.py", line 1858, in softmax\n    ret = input.softmax(dim, dtype=dtype)\n', 'torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.15 GiB. GPU 0 has a total capacty of 79.33 GiB of which 995.81 MiB is free. Including non-PyTorch memory, this process has 78.34 GiB memory in use. Of the allocated memory 74.97 GiB is allocated by PyTorch, and 2.33 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF\n']

但是当--modules_to_save embed_tokens resampler，训练参数量为652M，也就比上面少了400M，bs=4训练的时候显存只用了40-50G，当加入vpm的时候bs=1显存都超了是为什么呢
modelscope / ms-swift

微调minicpm-v2_6模型OOM报错 #1710