haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.1k stars 2.21k forks source link

[Question] `nan` in finetuned model weight #1278

Open lezhang7 opened 7 months ago

lezhang7 commented 7 months ago

Question

Hi,

I have successfully pretrained the mm_projector, and finish the finetune stage with following script:

################## LLaMA-2 ##################
PROMPT_VERSION="llava_llama_2"
MODEL_VERSION="llama-2-7b-chat"
################## LLaMA-2 ##################

deepspeed --num_gpus=4 llava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
    --version $PROMPT_VERSION \
    --data_path ~/scratch/datasets/llava/llava_instruct_80k.json \
    --image_folder ~/scratch/datasets/coco/2017/train2017 \
    --vision_tower openai/clip-vit-large-patch14 \
    --pretrain_mm_mlp_adapter ./checkpoints/llava-llama-2-7b-chat-pretrain-baseline/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --output_dir ./checkpoints/llava-$MODEL_VERSION-finetune \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

However, when I evaluate on the task, I always find the output to be empty and inference become quiet slow, so I debug step by step, and find that the weight of llava-llama-2-7b-chat-finetune/model-0000x-of-00003.safetensors seems contain many nans as shown follows:

{'lm_head.weight': tensor([[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]], dtype=torch.bfloat16),
 'model.layers.23.input_layernorm.weight': tensor([nan, nan, nan,  ..., nan, nan, nan], dtype=torch.bfloat16),
 'model.layers.23.mlp.down_proj.weight': tensor([[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]], dtype=torch.bfloat16),
...
...
...

I follow the official pretraining and finetuning script. Any idea why this happends and how to fix it?

pipixin321 commented 7 months ago

I met the same question when I run finetune_lora.sh, the loss suddenly increases during training.The only modification I made was to use half of the llava-v1_5_mix665k samples.

image
lezhang7 commented 7 months ago

I use lora and it works without this issue, but still wonder why this happened when full-parameters finetuning.

sunwhw commented 3 months ago

hi, have you solved the problem? I also met the problem when finetune the videollava sourced from llava...

ghazalsaheb commented 3 months ago

I had the same issue and I figured it was because I was using hugging face's "llava-hf/llava-1.5-7b-hf" as the base model. I switched the base to "liuhaotian/llava-v1.5-7b" and it resolved the NaN issue. Plus, the training performance got much better.