Describe the bug
deepspeed-zero3,lora_target_modules ALL,model_type phi3-vision-128k-instruct,多机多卡,在resume from checkpoint的时候,模型似乎无法加载。需要注意的是,此时的chekpoint文件夹内只包括lora相关的参数,但是报错显示模型在加载更多参数。
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1708, in _inner_training_loop
deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 402, in deepspeed_load_checkpoint
loadpath, = deepspeed_engine.load_checkpoint(
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2724, in load_checkpoint
load_path, client_states = self._load_checkpoint(load_dir,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2794, in _load_checkpoint
self.load_module_state_dict(checkpoint=checkpoint,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2587, in load_module_state_dict
self.module.load_state_dict(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
Missing key(s) in state_dict: "base_model.model.model.embed_tokens.weight", "base_model.model.model.vision_embed_tokens.glb_GN", "base_model.model.model.vision_embed_tokens.sub_GN", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.embeddings.class_embedding", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.embeddings.patch_embedding.weight", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.embeddings.position_embedding.weight", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.pre_layrnorm.weight", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.pre_layrnorm.bias", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.k_proj.base_layer.weight", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.k_proj.base_layer.bias", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.v_proj.base_layer.weight", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.v_proj.base_layer.bias", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.q_proj.base_layer.weight", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.q_proj.base_layer.bias", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.out_proj.base_layer.weight", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.out_proj.base_layer.bias", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.layer_norm1.weight", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.layer_norm1.bias", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.mlp.fc1.base_layer.weight", 省略
Describe the bug deepspeed-zero3,lora_target_modules ALL,model_type phi3-vision-128k-instruct,多机多卡,在resume from checkpoint的时候,模型似乎无法加载。需要注意的是,此时的chekpoint文件夹内只包括lora相关的参数,但是报错显示模型在加载更多参数。