finetune error about model size

Skylight-Lark commented 2 months ago

Issue Title: use the finetune script but meet error

Environment

Platform: Ubuntu Linux
GPU: A5000 x 8
Torch Version: 2.1.2

Transformers Version: 4.41.0.dev0

Issue Description

when i used the llava-pp codebase and the finetune script to finetune our model, it appeared that :

Error(s) in loading state_dict for Sequential : size mismatch for 0.weight: copying a param with shape torch.Size([4096, 3072]) from checkpoint, the shape in current model is torch.Size([0]).
    size mismatch for 0.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
    size mismatch for 2.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
    size mismatch for 2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).

but when i use the llava official codebase and script , i didn't meet that error, which is weird! it seems that the deepspeed zero3 has some bug and can't gather the sharded parameters.

I've tried various approaches in the deepspeed issues and transformers issues to fix this issue but haven't been successful. Any help would be greatly appreciated!

Full Script

#!/bin/bash
deepspeed --master_port=25001 llava/train/train_mem.py \
--lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed ./scripts/zero3_offload.json \
    --model_name_or_path ../LLaVA-Meta-Llama-3-8B-Instruct-FT-S2 \
    --version llama3 \
    --data_path ../Data/our_data.jsonl \
    --image_folder ../Data/our_data_image \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter ./checkpoints/llava-llama-8b-stage1/mm_projector.bin \
    --gradient_checkpointing True \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/llava-llama-8b-lora-state2 \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 3 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 4096 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

Full StackTrace

/home/meijieru/.conda/envs/llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in vers
ion 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[2024-05-14 01:55:40,956] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 295, num_elems = 8.06B
Some weights of the model checkpoint at ../LLaVA-Meta-Llama-3-8B-Instruct-FT-S2 were not used when initializing LlavaLlamaForCausalLM: ['model.vision_tower'...]

- This IS expected if you are initializing LlavaLlamaForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertF
orSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlavaLlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClas
sification model from a BertForSequenceClassification model).
/home/meijieru/.conda/envs/llava/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage
 will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of
 tensor.storage()                                                                                                                                                              
  return self.fget.__get__(instance, owner)()   

Error(s) in loading state_dict for Sequential:
        size mismatch for 0.weight: copying a param with shape torch.Size([4096, 3072]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for 0.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for 2.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for 2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).

mmaaz60 commented 2 months ago

Hi @Skylight-Lark

Thank you for your interest in our work. I notice another person also faced the same error, however not yet sure about what causing it. Could you please try using zero3.json or zero2.json instead of zero3_offload.json and see if it solves the issue?

Thank You

Skylight-Lark commented 2 months ago

Hi @mmaaz60 Thank you for your quick response. when i use the zore3.json , it also has the same problem. And when using the zero2.json, it will OOM.

mmaaz60 commented 2 months ago

Hi @Skylight-Lark

May I know which deepspeed version are you using? Try to use version 0.13.1 and it may help. Further in llava_arch.py, moving the mm_projector initialization code outside for loop may help as well. (https://github.com/haotian-liu/LLaVA/blob/c121f0432da27facab705978f83c4ada465e46fd/llava/model/llava_arch.py#L36)

Skylight-Lark commented 2 months ago

Hi @mmaaz60

it works when moving the mm_projector initialization code outside for loop. Thank you for your patience in solving the issue.

Chloe1997 commented 1 month ago

Hi @mmaaz60

it works when moving the mm_projector initialization code outside for loop. Thank you for your patience in solving the issue.

Hi, I encountered the same issue while fine-tuning LoRA. Could you please share your solution if possible?

mbzuai-oryx / LLaVA-pp