模型测试问题 - Githubissues

cwwhh commented 5 months ago

作者您好，拜读您文章后进行试验复现时出现一些问题，希望您给予帮助。由于内存有限，我们使用zero3策略训练模型后，在测试阶段遇到问题如下：

  train()
  File "main_llm_cls.py", line 78, in train
    model = PeftModelForCLS.from_pretrained(model, model_args.peft_path, is_trainable=False)
  File "/root/autodl-tmp/LEADER/llm/lora_cls.py", line 94, in from_pretrained
    model.load_adapter(model_id, adapter_name, **kwargs)
  File "/root/autodl-tmp/LEADER/llm/lora_cls.py", line 130, in load_adapter
    set_peft_model_state_dict(self, adapters_weights, adapter_name=adapter_name)
  File "/root/autodl-tmp/LEADER/llm/lora_cls.py", line 282, in set_peft_model_state_dict
    model.load_state_dict(peft_model_state_dict, strict=False)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCLS:
        size mismatch for base_model.model.model.layers.0.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
        size mismatch for base_model.model.model.layers.0.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
        size mismatch for base_model.model.model.layers.0.mlp.down_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([8, 11008]).
        size mismatch for base_model.model.model.layers.1.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
        size mismatch for base_model.model.model.layers.1.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
        size mismatch for base_model.model.model.layers.1.mlp.down_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([8, 11008]).
        size mismatch for base_model.model.model.layers.2.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
        size mismatch for base_model.model.model.layers.2.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).

训练好后的模型文件如图 check-point

liuqidong07 commented 5 months ago

您好，感谢对我们工作的关注。因为我没有试过zero3训练，所以也无法给出准确的解决方案。但是从报错来看，可能是模型训练完后保存adapter出了问题，建议把--save_steps设置为1，然后调试一下保存checkpoint的代码，具体可以调试一下peft/utils/save_and_load中的set_peft_model_state_dict函数。祝好

cwwhh commented 5 months ago

谢谢您的帮助，但是修改save_steps后还是同样的报错。请问你们用的多少GB的内存呢？或者是否可以将训练好的代码上传一下呢？

liuqidong07 commented 5 months ago

您好，因为checkpoint过大，不是很方便上传，我再想一想办法。
此外，修改save_step并不能直接解决问题，而是需要调试一下存储checkpoint的代码是否存在问题，因为报错显示的是存储的LoRA参数维度是0，建议可以先想办法跑通基础的LoRA代码。
我理解您想问的是训练时的显存？我们训练时用的是4卡V100 32G。祝好

cwwhh commented 5 months ago

好的，谢谢您。

liuqidong07 / LEADER-pytorch

模型测试问题 #4