haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.12k stars 2.21k forks source link

[Question] train lightning llava-7b-v1 got relatively higher loss curve. #173

Open KosumosuL opened 1 year ago

KosumosuL commented 1 year ago

Question

Here is my training script:

  torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
    llava/train/train_mem.py \
    --model_name_or_path llama-vicuna-7b-v1.1 \
    --version v1 \
    --data_path CC3M/chat.json \
    --image_folder CC3M/images \
    --vision_tower openai/clip-vit-large-patch14 \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end \
    --bf16 True \
    --output_dir ./checkpoints/llava-lightning-7b-pretrain-v1 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to none

I use the llama-vicuna-7b-v1.1 got from FastChat, and finetune it with CC595K image-text pairs, with version v1 conversation format, the losses are keeping around 2.5 and hard to decrease. Yet when useing llama-vicuna-7b with version v0, the losses are repaidly decreased to 1.3~1.5.

Is there any bugs in the latest code? Or the loss is just normal for version v1?

MrigankRaman commented 1 year ago

Even I am curious about this

haotian-liu commented 1 year ago

Hi @KosumosuL @MrigankRaman

We are seeing the loss of Vicuna v1-1 being larger than Vicuna v0 models as well. Although the qualitative results look good, we are also investigating the reasons. One possible reason is that in v0 prompts, there are end of sequence token "###" after each sentence, while in v1, it is only added after the GPT response. Furthermore, "###" can sometimes be tokenized as "#" and "##" (two tokens instead of one token).

Given that "###" is really easy for model to optimize to predict, it may be the reason why the loss of v0 model is lower than v1.

If you have other better explanations or insights, please let me know. I'll update if there are more findings as well.

Thanks!