haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.36k stars 2.25k forks source link

[Usage] loss quickly drops to near zero #1750

Open gaojianzhang opened 3 weeks ago

gaojianzhang commented 3 weeks ago

Describe the issue

Issue: Excuse me, I would like to ask: when fine-tuning LLaVA with LoRA on a custom dataset to create an agent for a multimodal binary choice task, the loss quickly drops to near zero during training. However, during testing, it gives the same output regardless of input, with very low accuracy. Interestingly, this issue doesn't occur when training an agent for an eight-choice task. Why might this be happening? Command:

#!/bin/bash

# 设置环境变量以禁用自动下载
export TRANSFORMERS_OFFLINE=1
export HF_FORCE_DOWNLOAD=False
#export CUDA_VISIBLE_DEVICES=0,3
export WANDB_MODE=disabled

# Set the prompt and model versions directly in the command
deepspeed --master_port 29502 --include localhost:0 /home/jianzhang_gao/work/llava_copy/LLaVA/llava/train/train_mem.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed /home/jianzhang_gao/work/LLaVA/scripts/zero3.json \
    --model_name_or_path /home/jianzhang_gao/work/LLaVA/checkpoints/llava-v1.5-7b \
    --version v1 \
    --data_path /home/jianzhang_gao/work/agent/train_data_choosezf.json \
    --image_folder /data/sdf1/jianzhang_gao/datasets \
    --vision_tower /home/jianzhang_gao/work/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir /data/sdf1/jianzhang_gao/checkpoints/llava-v1.5-7b-qlora-choosezf_new\
    --num_train_epochs 3 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 1e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 4096 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True 

Log:

{'loss': 0.963, 'grad_norm': 4.4342705227056785, 'learning_rate': 2.469135802469136e-07, 'epoch': 0.0}
{'loss': 0.888, 'grad_norm': 4.841685046830675, 'learning_rate': 4.938271604938272e-07, 'epoch': 0.0}
{'loss': 1.0173, 'grad_norm': 5.919435240014991, 'learning_rate': 7.407407407407408e-07, 'epoch': 0.0}
{'loss': 0.9541, 'grad_norm': 5.474495974051729, 'learning_rate': 9.876543209876544e-07, 'epoch': 0.0}
{'loss': 0.8798, 'grad_norm': 5.1466626246551215, 'learning_rate': 1.234567901234568e-06, 'epoch': 0.0}
{'loss': 0.9335, 'grad_norm': 4.8220511516679325, 'learning_rate': 1.4814814814814817e-06, 'epoch': 0.0}
{'loss': 0.9582, 'grad_norm': 5.115023479668671, 'learning_rate': 1.728395061728395e-06, 'epoch': 0.0}
{'loss': 0.9585, 'grad_norm': 4.462929971328918, 'learning_rate': 1.9753086419753087e-06, 'epoch': 0.0}
{'loss': 0.9771, 'grad_norm': 4.560711249915778, 'learning_rate': 2.2222222222222225e-06, 'epoch': 0.0}
{'loss': 0.9388, 'grad_norm': 4.655790025263413, 'learning_rate': 2.469135802469136e-06, 'epoch': 0.0}
{'loss': 0.9696, 'grad_norm': 5.0228492583494955, 'learning_rate': 2.7160493827160496e-06, 'epoch': 0.0}
{'loss': 0.8063, 'grad_norm': 3.83851402590431, 'learning_rate': 2.9629629629629633e-06, 'epoch': 0.0}
{'loss': 0.8025, 'grad_norm': 3.9551771346273314, 'learning_rate': 3.209876543209877e-06, 'epoch': 0.0}
{'loss': 0.7713, 'grad_norm': 3.6043135200140473, 'learning_rate': 3.45679012345679e-06, 'epoch': 0.0}
{'loss': 0.7524, 'grad_norm': 3.3629909471943984, 'learning_rate': 3.7037037037037037e-06, 'epoch': 0.0}
{'loss': 0.6855, 'grad_norm': 3.484192605038503, 'learning_rate': 3.9506172839506175e-06, 'epoch': 0.0}
{'loss': 0.5346, 'grad_norm': 2.8458294386753633, 'learning_rate': 4.197530864197531e-06, 'epoch': 0.0}
{'loss': 0.6012, 'grad_norm': 3.643244489250016, 'learning_rate': 4.444444444444445e-06, 'epoch': 0.0}
{'loss': 0.5345, 'grad_norm': 2.5836859109644585, 'learning_rate': 4.691358024691358e-06, 'epoch': 0.0}
{'loss': 0.5192, 'grad_norm': 3.4653607591195, 'learning_rate': 4.938271604938272e-06, 'epoch': 0.0}
{'loss': 0.3459, 'grad_norm': 2.0753601777883817, 'learning_rate': 5.185185185185185e-06, 'epoch': 0.0}
{'loss': 0.3861, 'grad_norm': 2.22823248488686, 'learning_rate': 5.432098765432099e-06, 'epoch': 0.0}
{'loss': 0.3347, 'grad_norm': 2.295094163340443, 'learning_rate': 5.6790123456790125e-06, 'epoch': 0.01}
{'loss': 0.3188, 'grad_norm': 2.379559165093721, 'learning_rate': 5.925925925925927e-06, 'epoch': 0.01}
{'loss': 0.2183, 'grad_norm': 1.648311720008458, 'learning_rate': 6.172839506172839e-06, 'epoch': 0.01}
{'loss': 0.206, 'grad_norm': 1.8461748103303606, 'learning_rate': 6.419753086419754e-06, 'epoch': 0.01}