multimodal-art-projection / MAP-NEO

742 stars 70 forks source link

error in batch_convert_ckpt #32

Open hanjr92 opened 1 week ago

hanjr92 commented 1 week ago

when i use bash neo/scripts/batch_convert_ckpt.sh:

received transformer layer 17
received final norm
received output layer
Saving model to disk ...
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/megatron/tools/checkpoint/saver_llama2_hf_bf.py", line 108, in save_checkpoint
    model = AutoModelForCausalLM.from_pretrained(None, config=llama_conf, state_dict=state_dict, torch_dtype=torch_dtype)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 484, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2881, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3278, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
    size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
    size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
    size mismatch for model.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
    size mismatch for model.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
    size mismatch for model.layers.2.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
    size mismatch for model.layers.2.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
    size mismatch for model.layers.3.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
    size mismatch for model.layers.3.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
    size mismatch for model.layers.4.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
    size mismatch for model.layers.4.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
    size mismatch for model.layers.5.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
hanjr92 commented 1 week ago

i wish you can provide 2b hf-verison checkpoints.

Kevinstone-199898 commented 5 days ago

Excuse me, I wonder where do you get the checkpoint? from huggingface?

hanjr92 commented 4 days ago

Excuse me, I wonder where do you get the checkpoint? from huggingface?

yes, i got 2b checkponits from huggingface. The checkpoints look like megatron version. So, i encountered this error when i used tools to convert it.

Kevinstone-199898 commented 4 days ago

You just direcctly run bash neo/scripts/batch_convert_ckpt.sh without any modification and then encountered this error?It seems that the loader runs correctly and the saver part is wrong

hanjr92 commented 4 days ago

You just direcctly run bash neo/scripts/batch_convert_ckpt.sh without any modification and then encountered this error?It seems that the loader runs correctly and the saver part is wrong

yes,i didn't modifiy any files. Maybe neo/scripts/batch_convert_ckpt.sh can only works on the 7b model?

Kevinstone-199898 commented 4 days ago

No, I tried to convert 7B checkpoints, and the error occurred in the loader part

hanjr92 commented 4 days ago

No, I tried to convert 7B checkpoints, and the error occurred in the loader part

ok, it looks like there are some bugs in it.