modelscope / ms-swift

Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
4.07k stars 360 forks source link

GLM4v LORA微调后,断点训练失败 #1133

Closed lyc728 closed 2 months ago

lyc728 commented 4 months ago

训练脚本

NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
    --model_type glm4v-9b-chat \
    --model_id_or_path /MLLM/new_models/ZhipuAI/glm-4v-9b \
    --dataset /data_archive/LVLM_label/KIE4GLM4V_train.json \
    --ddp_find_unused_parameters true \
    --output_dir /data/swift/us_desc/ \
    --batch_size 3 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --gradient_checkpointing true \
    --use_flash_attn true \
    --num_train_epochs 5 \
    --save_strategy "steps" \
    --save_steps 300 \
    --sft_type "lora" \
    --save_total_limit 2 \
    --save_only_model false \
    --resume_from_checkpoint /glm4v-9b-chat/v0-20240613-113744/checkpoint-4200 \

报错信息

  File "/data/swift/swift/utils/run_utils.py", line 27, in x_main
    result = llm_x(args, **kwargs)
  File "/data/swift/swift/llm/sft.py", line 301, in llm_sft
    trainer.train(training_args.resume_from_checkpoint)
  File "/data/swift/swift/trainers/trainers.py", line 50, in train
    res = super().train(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2902, in training_step
    loss = self.compute_loss(model, inputs)
  File "/data/swift/swift/trainers/trainers.py", line 188, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 687, in forward
    return model_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 675, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/opt/conda/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 1430, in forward
    return self.base_model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 179, in forward
    return self.model.forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/glm-4v-9b/modeling_chatglm.py", line 1012, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/glm-4v-9b/modeling_chatglm.py", line 901, in forward
    hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/glm-4v-9b/modeling_chatglm.py", line 658, in forward
    layer_ret = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/glm-4v-9b/modeling_chatglm.py", line 559, in forward
    layernorm_output = self.input_layernorm(hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/glm-4v-9b/modeling_chatglm.py", line 206, in forward
    return (self.weight * hidden_states).to(input_dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 12.19 MiB is free. Process 11096 has 0 bytes memory in use. Including non-PyTorch memory, this process has 0 bytes memory in use. Process 11094 has 0 bytes memory in use. Process 11095 has 0 bytes memory in use. Of the allocated memory 35.51 GiB is allocated by PyTorch, and 1.34 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Jintao-Huang commented 4 months ago

OOM了, batch_size=1, 然后提高gradient_accumulation_steps

Jintao-Huang commented 4 months ago

有个参数可以同时了解一下, --resume_only_model https://github.com/modelscope/swift/blob/main/docs/source/LLM/%E5%91%BD%E4%BB%A4%E8%A1%8C%E5%8F%82%E6%95%B0.md

lyc728 commented 4 months ago

现在是有个问题,为什么在训练时,每张卡的显存占用不均呢?

企业微信截图_17182675935953

还有个问题,我用这个脚本 NPROC_PER_NODE=4 \ CUDA_VISIBLE_DEVICES=0,1,2,3 \ 每张卡都去加载,显卡会爆掉

而用这个脚本 NPROC_PER_NODE=4 \

MASTER_PORT=29521

CUDA_VISIBLE_DEVICES=0,1,2,3 \ 会按照顺序加载,但是训练过程中显卡会存在不均情况,最终也会超显存

Jintao-Huang commented 4 months ago

按道理是 如果设置NPROC_PER_NODE就会采用device_map的方式, 将模型横着切开, 分配到各个卡中, 每张卡是不均匀的. 如果设置了, 就会采用DDP, zero2, zero3的方式.

lyc728 commented 4 months ago

按道理是 如果设置NPROC_PER_NODE就会采用device_map的方式, 将模型横着切开, 分配到各个卡中, 每张卡是不均匀的. 如果设置了, 就会采用DDP, zero2, zero3的方式.

你们可以测试下,这两个代码几乎类似,只是多了一行,就显卡占用不一样

lyc728 commented 4 months ago

现在训练中发现两个问题,第一个是lora微调中,发现损失几乎不变,推理的结果是不符合训练的格式

企业微信截图_17183572075943 企业微信截图_17183572251188

第二个是采用全参数微调,测试发现指标并没有通用模型推理指标高

Jintao-Huang commented 4 months ago

是的 之前有一个bug, 试试重新拉取一下最新的代码

lyc728 commented 4 months ago

是的 之前有一个bug, 试试重新拉取一下最新的代码

这个是修复了断点训练问题还是lora微调损失变化问题