haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.15k stars 2.22k forks source link

[Question] Overfitting in my finetune experiment using my custom data #847

Open Pro-flynn opened 11 months ago

Pro-flynn commented 11 months ago

Question

After finetuning using the my custorm data, the finetuned llava model is overfitting. In my experiments, I following the your instrcuction( cited in https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md).

  1. convert my data to the required format, as follows:

    {
        "id": "mamian_fengwo_000252",
        "image": "mamian_fengwo_000252.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nWhere is honeycombing on pillars in the image? answer in [[x0,y0,x1,y1]] format."
            },
            {
                "from": "gpt",
                "value": "[[0.7677605, 0.815028, 0.8906875, 0.92288], [0, 0.675963, 0.03476, 0.890241], [0.664312, 0.7921855, 0.7664839999999999, 0.9241485], [0.1377295, 0.7824074999999999, 0.2766145, 0.9952505]]"
            }
        ]
    },
  2. use the office scripts (cited in https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune_task_lora.sh), as follows:

    deepspeed llava/train/train_mem.py \                                                                                                               
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \                                                                      
    --deepspeed ./scripts/zero3.json \                                                                                                             
    --model_name_or_path liuhaotian/llava-v1.5-13b \                                                                                               
    --version v1 \                                                                                                                                 
    --data_path ./playground/data/llava_lora_finetune_mamianfengwo_floatanno_xywh_train.json \                                                     
    --image_folder ./playground/data/images \                                                                                                      
    --vision_tower openai/clip-vit-large-patch14-336 \                                                                                             
    --mm_projector_type mlp2x_gelu \                                                                                                               
    --mm_vision_select_layer -2 \                                                                                                                  
    --mm_use_im_start_end False \                                                                                                                  
    --mm_use_im_patch_token False \                                                                                                                
    --image_aspect_ratio pad \                                                                                                                     
    --group_by_modality_length True \                                                                                                              
    --bf16 True \                                                                                                                                  
    --output_dir ./checkpoints/llava-v1.5-13b-task-lora_100epoch_floatanno_xywh_train.json \                                                       
    --num_train_epochs 100 \                                                                                                                       
    --per_device_train_batch_size 16 \                                                                                                             
    --per_device_eval_batch_size 4 \                                                                                                               
    --gradient_accumulation_steps 1 \                                                                                                              
    --evaluation_strategy "no" \                                                                                                                   
    --save_strategy "steps" \                                                                                                                      
    --save_steps 50000 \                                                                                                                           
    --save_total_limit 10\                                                                                                                         
    --learning_rate 2e-4 \                                                                                                                         
    --weight_decay 0. \                                                                                                                            
    --warmup_ratio 0.03 \                                                                                                                          
    --lr_scheduler_type "cosine" \                                                                                                                 
    --logging_steps 1 \                                                                                                                            
    --tf32 True \                                                                                                                                  
    --model_max_length 2048 \                                                                                                                      
    --gradient_checkpointing True \                                                                                                                
    --dataloader_num_workers 4 \                                                                                                                   
    --lazy_preprocess True \ 

    we find the finetuned llava model is underfitting by seting the epoch as 1-10, so we setting the epoch as 50-100, however the finetuned model is overfitting.

  3. We find that the train loss=0 when the training is ending, and the performacen in test data is very poor.

Pro-flynn commented 11 months ago

How do you think I should adjust my training strategy?

Linziyang1999 commented 11 months ago

i think the epoch num is tooooo big, my model is also a little bit overfitting after full finetune in 20k data and 2 epoch(batch 4 ), and the loss was 0.67,

FHL1998 commented 11 months ago

What's the current inference performance? Do you think LLava is suitable for this kind of object detection task?

Linziyang1999 commented 11 months ago

Maybe you can check ocr llava,someone already did it. And they use ocr dataset both in pretrain and finetune.

Linziyang1999 commented 11 months ago

Llm has show outstanding performance in ocr . I think llava can made it

Linziyang1999 commented 11 months ago

https://llavar.github.io/ Check this

Nomiluks commented 11 months ago

I've also adopted a similar approach for training my model. However, I find myself perplexed upon reviewing the training statistics.

wandb: Run history:
wandb:                    train/epoch ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:              train/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:            train/learning_rate ▄███████▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁
wandb:                     train/loss █▆▇▆▆▆▆▅▆▅▄▄▄▃▂▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:               train/total_flos ▁
wandb:               train/train_loss ▁
wandb:            train/train_runtime ▁
wandb: train/train_samples_per_second ▁
wandb:   train/train_steps_per_second ▁
wandb: 
wandb: Run summary:
wandb:                    train/epoch 20.0
wandb:              train/global_step 360
wandb:            train/learning_rate 0.0
wandb:                     train/loss 0.0072
wandb:               train/total_flos 1967070044160.0
wandb:               train/train_loss 0.4161
wandb:            train/train_runtime 848.715
wandb: train/train_samples_per_second 3.323
wandb:   train/train_steps_per_second 0.424

I'm puzzled about the distinction between train/train_loss with a value of 0.4161 and train/loss with a value of 0.0072. Could someone please clarify this for me?

Nomiluks commented 11 months ago

Also, I have noticed the same issue. The results on the unseen dataset is really bad

aneet-javis commented 11 months ago

Could anyone tell what hardware are you guys finetuning on? I tried on one A10G with batch_per_device =1, But getting OOM error.

Nomiluks commented 11 months ago

After trying with a couple of different machines I used the A100 GCP instance and it worked like a charm.

haotian-liu commented 11 months ago

You can try lowering the number of epochs. Check out the example here, I finetuned for 3 epochs with batch size 8 on a 100 GPT-4V captioned anime examples, and it already works great: https://github.com/haotian-liu/LLaVA/issues/766#issuecomment-1800214174. You an also take a look at the wandb logs, the training loss should not be too low, which indicates overfitting. Additionally, fusing a few samples from LLava-instruct or llava-v1.5 data mixture may also help reduce the overfitting.

@Nomiluks one of them is probably the end-of-epoch stats (there will be just one number for a single experiment), and the other may be the last iter stats (one number for each iterations, but only the last iter is displayed), looking at the wandb interface may allow you better understand the stats.

Nomiluks commented 11 months ago

You can try lowering the number of epochs. Check out the example here, I finetuned for 3 epochs with batch size 8 on a 100 GPT-4V captioned anime examples, and it already works great: #766 (comment). You an also take a look at the wandb logs, the training loss should not be too low, which indicates overfitting. Additionally, fusing a few samples from LLava-instruct or llava-v1.5 data mixture may also help reduce the overfitting.

@Nomiluks one of them is probably the end-of-epoch stats (there will be just one number for a single experiment), and the other may be the last iter stats (one number for each iterations, but only the last iter is displayed), looking at the wandb interface may allow you better understand the stats.

Thank for your response @haotian-liu

I'm working on implementing LLaVA to identify pixel-based image forgery or tampering in my dataset. I currently have 100 samples, and I'm considering LORA based fine-tuning as suggested in the documentation. Do you believe this sample size is sufficient for effective fine-tuning? Additionally, I'm open to any advice or best practices for training LLaVA to specifically detect image forgery. Your insights would be greatly appreciated!

Training Example:

    {
        "id": "tampered_654c8796140dc970e0d179d5-back",
        "image": "tampered_654c8796140dc970e0d179d5-back.jpeg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nAnalyze the provided document image with the objective of detecting potential instances of image forgery resulting from digital tampering or manipulation. Identify all manipulated regions and present the results in the following format: [[x0, y0, x1, y1]]. If no tampered regions are identified, please return [[]]."
            },
            {
                "from": "gpt",
                "value": "[[0.578, 0.604, 0.938, 0.99]]"
            }
        ]
    }
FHL1998 commented 11 months ago

I wonder if LLaVA faces a brand new domain, should we do something like fine-tuning the visual encoder at the first step cause right now the vision encoder is not tuned?

FHL1998 commented 11 months ago

@Nomiluks According to my experiment, a size of 100 may easily cause overfitting, I tried to enlarge my dataset to 8000 entries (containing a few LLaVA instructions). However, the result shows the descending performance even cannot interpret the "man behind the taxi" example, I am still figuring out the cause.

ronnymunthe99 commented 11 months ago

@Nomiluks According to my experiment, a size of 100 may easily cause overfitting, I tried to enlarge my dataset to 8000 entries (containing a few LLaVA instructions). However, the result shows the descending performance even cannot interpret the "man behind the taxi" example, I am still figuring out the cause.

Yes, I am also having the same problem, have you found out the cause?

Nomiluks commented 11 months ago

i think the epoch num is tooooo big, my model is also a little bit overfitting after full finetune in 20k data and 2 epoch(batch 4 ), and the loss was 0.67,

Is the 0.67 the overall loss you're referring to? It seems a bit high; typically, we aim for a loss close to 0 for a well-fit model. This value might suggest that the model is underfitting. Could you provide more context or details about the training process? It's important to assess whether this level of loss is acceptable for your specific use case.

Nomiluks commented 11 months ago

@Nomiluks According to my experiment, a size of 100 may easily cause overfitting, I tried to enlarge my dataset to 8000 entries (containing a few LLaVA instructions). However, the result shows the descending performance even cannot interpret the "man behind the taxi" example, I am still figuring out the cause.

yeah, it seems it is unable to learn either the model gets overfit and underfit.

haotian-liu commented 11 months ago

I am wondering how big of a difference is the domain shift? For example, for the extremely detailed anime captioning, I was actually surprised by what it can do with 100 examples: https://github.com/haotian-liu/LLaVA/issues/766#issuecomment-1800214174

FHL1998 commented 11 months ago

I am wondering how big of a difference is the domain shift? For example, for the extremely detailed anime captioning, I was actually surprised by what it can do with 100 examples: #766 (comment)

@haotian-liu Here are two examples from my side, and the loss curve in 3 epochs: 1701216545837 1701216798648

fd6ddbf168bc68cb276eb2e45646f6d

haotian-liu commented 11 months ago

The loss curve is very concerning here. Here is one of the LoRA finetuning loss curve on stable diffusion prompts.

image

The initial spike suggests that there is something wrong.

haotian-liu commented 11 months ago

@Pro-xiaowen

Btw, just noticed this: [0.7677605, 0.815028, 0.8906875, 0.92288] These coordinates seems overly accurate. You may just need three digits. The later digits may just cause the model to hallucinate.

FHL1998 commented 11 months ago

The loss curve is very concerning here. Here is one of the LoRA finetuning loss curve on stable diffusion prompts.

image

The initial spike suggests that there is something wrong.

@haotian-liu Thx for your reply! May I ask how many samples are included in the dataset, I mean the extra LLaVA instruction samples and the total number of samples.

Linziyang1999 commented 11 months ago

i think the epoch num is tooooo big, my model is also a little bit overfitting after full finetune in 20k data and 2 epoch(batch 4 ), and the loss was 0.67,

Is the 0.67 the overall loss you're referring to? It seems a bit high; typically, we aim for a loss close to 0 for a well-fit model. This value might suggest that the model is underfitting. Could you provide more context or details about the training process? It's important to assess whether this level of loss is acceptable for your specific use case.

thus llm generate more word beyond your answer, it not mean the answer is wrong. after experiment loss between 0.6~0.8 is normal. if you want model more accuracy, you may focus on improve size of dataset. here is my loss, and model is work well. ^_^, hope it can help you.

截屏2023-11-29 11 24 53
FHL1998 commented 11 months ago

thus llm generate more word beyond your answer, it not mean the answer is wrong. after experiment loss between 0.6~0.8 is normal. if you want model more accuracy, you may focus on improve size of dataset. here is my loss, and model is work well. ^_^, hope it can help you. 截屏2023-11-29 11 24 53

@Linziyang1999 May I ask the number of samples included in your dataset (How many customized samples and original LLaVA samples)?

Linziyang1999 commented 11 months ago

thus llm generate more word beyond your answer, it not mean the answer is wrong. after experiment loss between 0.6~0.8 is normal. if you want model more accuracy, you may focus on improve size of dataset. here is my loss, and model is work well. ^_^, hope it can help you. 截屏2023-11-29 11 24 53

@Linziyang1999 May I ask the number of samples included in your dataset (How many customized samples and original LLaVA samples)?

custom sample is 20k, and i found there will be an error raised if dataset only have image conversation during train so i add few conversation in mix665k without image(10 maybe? just make it work well).

FHL1998 commented 11 months ago

@haotian-liu In my case, the loss seems to drop so quickly after only 30 steps, I have checked three things:

  1. I have already enlarged my dataset to 20k samples by mixing my customized dataset with LLaVA instruction samples;
  2. I have checked the dataset format (id, image, etc);
  3. Everything went well during the finetuning phase (no error, warning, or size mismatch).

1701234857439

Any obvious error that can be observed from my fine-tuning script or does anyone have any idea about what happened? B.T.W, I used 4 A100 (80GB).

deepspeed llava/train/train_mem.py \                                                                                                               
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \                                                                      
    --deepspeed ./scripts/zero3.json \                                                                                                             
    --model_name_or_path liuhaotian/llava-v1.5-13b \                                                                                               
    --version v1 \                                                                                                                                 
    --data_path dataset_finetune/llava_finetune_task_v2.json \                                                     
    --image_folder ./playground/data/images \                                                                                                      
    --vision_tower openai/clip-vit-large-patch14-336 \                                                                                             
    --mm_projector_type mlp2x_gelu \                                                                                                               
    --mm_vision_select_layer -2 \                                                                                                                  
    --mm_use_im_start_end False \                                                                                                                  
    --mm_use_im_patch_token False \                                                                                                                
    --image_aspect_ratio pad \                                                                                                                     
    --group_by_modality_length True \                                                                                                              
    --bf16 True \                                                                                                                                  
    --output_dir ./checkpoints/llava-v1.5-13b-task-lora-v2 \                                                       
    --num_train_epochs 3 \                                                                                                                       
    --per_device_train_batch_size 8 \                                                                                                             
    --per_device_eval_batch_size 2 \                                                                                                               
    --gradient_accumulation_steps 4 \                                                                                                              
    --evaluation_strategy "no" \                                                                                                                   
    --save_strategy "steps" \                                                                                                                      
    --save_steps 50000 \                                                                                                                           
    --save_total_limit 10\                                                                                                                         
    --learning_rate 2e-4 \                                                                                                                         
    --weight_decay 0. \                                                                                                                            
    --warmup_ratio 0.03 \                                                                                                                          
    --lr_scheduler_type "cosine" \                                                                                                                 
    --logging_steps 1 \                                                                                                                            
    --tf32 True \                                                                                                                                  
    --model_max_length 2048 \                                                                                                                      
    --gradient_checkpointing True \                                                                                                                
    --dataloader_num_workers 2 \                                                                                                                   
    --lazy_preprocess True \ 
Pro-flynn commented 11 months ago

i think the epoch num is tooooo big, my model is also a little bit overfitting after full finetune in 20k data and 2 epoch(batch 4 ), and the loss was 0.67,

we found the finetuned llava model is underfitting by seting the epoch as 1-10, even the prediction of train data is wrong! @Linziyang1999

CrazyBrick commented 11 months ago

we found the finetuned llava model is underfitting by seting the epoch as 1-10, even the prediction of train data is wrong!

so num_train_epochs 100 will make your loss smaller and the prediction more accurate? (I also encountered trouble, my fine-tuning didn't work) @Pro-xiaowen

rohitpanjwani03 commented 11 months ago

Hi guys, I am on colab running it on A100 and trying to fine-tuned using the below code, facing error like ./checkpoint or train_men.py and other train.py files

My code


!git clone https://github.com/haotian-liu/LLaVA.git
%cd /content/LLaVA
!pip install -q gradio .

!bash /content/LLaVA/scripts/v1_5/finetune.sh```

Can you guys help me with the correct way to fine-tune it?
ninjacode01 commented 9 months ago

Gibberish output (even on train data) with wierd loss curve on fully finetuning. Can someone please help me fix this.

I am trying to fully finetune the entire text-only model Vicuna-v1.5 using my custom QnA data comprising of 160k qa pairs, using the same finetuning script as provided in finetune_task.sh by omitting the multimodal parameters. Here is the loss curve on 2.4 epochs. wandb reportimage

zengxingchen commented 8 months ago

Gibberish output (even on train data) with wierd loss curve on fully finetuning. Can someone please help me fix this.

I am trying to fully finetune the entire text-only model Vicuna-v1.5 using my custom QnA data comprising of 160k qa pairs, using the same finetuning script as provided in finetune_task.sh by omitting the multimodal parameters. Here is the loss curve on 2.4 epochs. wandb reportimage

My loss function trend is in line with yours, and the fine-tuning is poorly done. Sad.

Ravi-Teja-konda commented 8 months ago

Hi guys, I am on colab running it on A100 and trying to fine-tuned using the below code, facing error like ./checkpoint or train_men.py and other train.py files

My code

!git clone https://github.com/haotian-liu/LLaVA.git
%cd /content/LLaVA
!pip install -q gradio .

!bash /content/LLaVA/scripts/v1_5/finetune.sh```

Can you guys help me with the correct way to fine-tune it?

Hello @rohitpanjwani03 ,

Were you able to fine tune , I'm also trying to fine tune Was any thing missing in your finetuning process

I'm using replicate to fine tune the model https://replicate.com/ravi-teja-konda/llava_finetune/versions/58ea2fa644ef90a63c50bc608a532e2acd5792208978760164f3db900247f062

But as the replicate currently does not support changing the hyperparameters, looks I need to finetune on my own by running it in colab, or do we have any alternatives like in hugging face, or have any tried it ?

ggcr commented 5 months ago

Just in case someone is having problems during inferece; if you have a script that uses /llava/eval/run_llava.py as baseline for inference, you should be careful with the args. In my case I noticed that the run_llava.py file will merge the LoRA weights if you specify a model-base, and they were already merged, hence the poor performance on inference.

If the weights are merged:

If the weights are not merged, you should also specify a model-base to merge them.