size mismatch error when finetuning

Hoteryoung commented 3 months ago

I came into the following error:

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:11<00:11, 11.91s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:17<00:00,  7.90s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:17<00:00,  8.51s/it]
Traceback (most recent call last):
  File "/xxxxx/Documents/code/GeoChat/geochat/train/train_mem.py", line 13, in <module>
    train()
  File "/xxxxx/Documents/code/GeoChat/geochat/train/train.py", line 828, in train
    model = GeoChatLlamaForCausalLM.from_pretrained(
  File "/xxxxx/anaconda3/envs/geochat/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/xxxxx/anaconda3/envs/geochat/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3310, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for GeoChatLlamaForCausalLM:
    size mismatch for model.vision_tower.vision_tower.vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([577, 1024]) from checkpoint, the shape in current model is torch.Size([1297, 1024]).
    You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

The script for finetuning:

srun --jobid $SLURM_JOBID \
    bash -c "python -m torch.distributed.run \
        --nproc_per_node $GPUS_PER_NODE \
        --nnodes $SLURM_NNODES \
        --node_rank $SLURM_PROCID \
        --master_addr $MASTER_ADDR \
        --master_port $MASTER_PORT \
        geochat/train/train_mem.py \
            --lora_enable True \
            --model_name_or_path $CODE_DIR/llava-v1.5-7b/ \
            --version $PROMPT_VERSION \
            --data_path $DATASET_DIR/GeoChat_Instruct.json \
            --image_folder $DATASET_DIR/share/softwares/kartik/GeoChat_finetuning/final_images_llava/  \
            --vision_tower openai/clip-vit-large-patch14-336/ \
            --mm_projector_type mlp2x_gelu \
            --pretrain_mm_mlp_adapter $CODE_DIR/llava-v1.5-7b/mm_projector.bin \
            --mm_vision_select_layer -2 \
            --mm_use_im_start_end False \
            --mm_use_im_patch_token False \
            --image_aspect_ratio pad \
            --bf16 True \
            --output_dir $OUTPUT_DIR \
            --num_train_epochs 1 \
            --per_device_train_batch_size 32 \
            --per_device_eval_batch_size 4 \
            --gradient_accumulation_steps 1 \
            --evaluation_strategy 'no' \
            --save_strategy 'epoch' \
            --save_steps 10000 \
            --save_total_limit 1 \
            --learning_rate 2e-4 \
            --weight_decay 0. \
            --warmup_ratio 0.03 \
            --lr_scheduler_type 'cosine' \
            --logging_steps 1 \
            --tf32 True \
            --model_max_length 2048 \
            --gradient_checkpointing True \
            --lazy_preprocess True \
            --dataloader_num_workers 16 \
            --report_to wandb \
            --deepspeed ./scripts/zero2.json"

Please note that I use the latest commit.

RogersSteve commented 1 month ago

Hello, i have the same problem, do you know how to fix it now?

Hoteryoung commented 1 month ago

Are you sure that your mismatched size is the same as mine? I remembered that I closed the issue because it was caused by an incorrect change of the original code somewhere by myself. But I really can't remember the details.

RogersSteve commented 1 month ago

Are you sure that your mismatched size is the same as mine? I remembered that I closed the issue because it was caused by an incorrect change of the original code somewhere by myself. But I really can't remember the details.

I have fixed this problem today, thanks for your answer. The reason I got this problem is that I used a wrong model to finetune。

kartikey9254 commented 2 weeks ago

Are you sure that your mismatched size is the same as mine? I remembered that I closed the issue because it was caused by an incorrect change of the original code somewhere by myself. But I really can't remember the details.

I have fixed this problem today, thanks for your answer. The reason I got this problem is that I used a wrong model to finetune。

can you help me doing the same . when i use the model geochat 7b it shows the size mismatch and when using llava-v1.5-7b it shows files not found in the directory . i have double crossed from huggingface that each file is present . i am unnable to train my data due to this . if you dont get my question can u pleas explain how can i train my data .

mbzuai-oryx / GeoChat

size mismatch error when finetuning #28