liuqidong07 / LEADER-pytorch

[arXiv'24] The official implementation code of LEADER.
https://arxiv.org/abs/2402.02803
MIT License
15 stars 3 forks source link

模型测试问题 #4

Closed cwwhh closed 3 months ago

cwwhh commented 5 months ago

作者您好,拜读您文章后进行试验复现时出现一些问题,希望您给予帮助。由于内存有限,我们使用zero3策略训练模型后,在测试阶段遇到问题如下:

  train()
  File "main_llm_cls.py", line 78, in train
    model = PeftModelForCLS.from_pretrained(model, model_args.peft_path, is_trainable=False)
  File "/root/autodl-tmp/LEADER/llm/lora_cls.py", line 94, in from_pretrained
    model.load_adapter(model_id, adapter_name, **kwargs)
  File "/root/autodl-tmp/LEADER/llm/lora_cls.py", line 130, in load_adapter
    set_peft_model_state_dict(self, adapters_weights, adapter_name=adapter_name)
  File "/root/autodl-tmp/LEADER/llm/lora_cls.py", line 282, in set_peft_model_state_dict
    model.load_state_dict(peft_model_state_dict, strict=False)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCLS:
        size mismatch for base_model.model.model.layers.0.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
        size mismatch for base_model.model.model.layers.0.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
        size mismatch for base_model.model.model.layers.0.mlp.down_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([8, 11008]).
        size mismatch for base_model.model.model.layers.1.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
        size mismatch for base_model.model.model.layers.1.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
        size mismatch for base_model.model.model.layers.1.mlp.down_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([8, 11008]).
        size mismatch for base_model.model.model.layers.2.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).
        size mismatch for base_model.model.model.layers.2.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([11008, 8]).

训练好后的模型文件如图 模型文件 check-point

liuqidong07 commented 5 months ago

您好,感谢对我们工作的关注。因为我没有试过zero3训练,所以也无法给出准确的解决方案。但是从报错来看,可能是模型训练完后保存adapter出了问题,建议把--save_steps设置为1,然后调试一下保存checkpoint的代码,具体可以调试一下peft/utils/save_and_load中的set_peft_model_state_dict函数。 祝好

cwwhh commented 5 months ago

谢谢您的帮助,但是修改save_steps后还是同样的报错。请问你们用的多少GB的内存呢?或者是否可以将训练好的代码上传一下呢?

liuqidong07 commented 5 months ago
  1. 您好,因为checkpoint过大,不是很方便上传,我再想一想办法。
  2. 此外,修改save_step并不能直接解决问题,而是需要调试一下存储checkpoint的代码是否存在问题,因为报错显示的是存储的LoRA参数维度是0,建议可以先想办法跑通基础的LoRA代码。
  3. 我理解您想问的是训练时的显存?我们训练时用的是4卡V100 32G。 祝好
cwwhh commented 5 months ago

好的,谢谢您。