[Question] Fine-Tuning LLaVA v1.5-7B lora on Custom Dataset and RuntimeError in Model Evaluation

rorubyy commented 9 months ago

Question

Hello LLaVA Team,

I've been working on fine-tuning the LLaVA v1.5-7B model on a custom dataset using the provided finetune_task_lora.sh script. Here is the configuration I used: bash scripts/v1_5/finetune_task_lora.sh

    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed ./scripts/zero3_offload.json \
    --model_name_or_path liuhaotian/llava-v1.5-7b \
    --version v1 \
    --data_path /workspace/Dataset/train.json \
    --image_folder ./ \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/llava-v1.5-7b-task-lora \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

After training, these were the results: {'train_runtime': 25078.2556, 'train_samples_per_second': 1.595, 'train_steps_per_second': 0.1, 'train_loss': 0.16062020410320182, 'epoch': 1.0}

When attempting to evaluate the model using model_vqa.py, I encountered a runtime error. The model loads correctly, but during the evaluation, I receive a RuntimeError: probability tensor contains either 'inf', 'nan' or element < 0. python llava/eval/model_vqa.py --model-path checkpoints/llava-v1.5-7b-task-lora/ --model-base checkpoints/llava-v1.5-7b/ --question-file Dataset/eval_ques .jsonl --image-folder ./ --answers-file /workspace/Dataset/eval_answer.jsonl Here's the traceback:

Loading LLaVA from base model...
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.14s/it]
Loading additional LLaVA weights...
Loading LoRA weights...
Merging LoRA weights...
Model is loaded...
  0%|                                                                                                                                                                  | 0/2108 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/workspace/llava/eval/model_vqa.py", line 125, in <module>
    eval_model(args)
  File "/workspace/llava/eval/model_vqa.py", line 66, in eval_model
    output_ids = model.generate(
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/generation/utils.py", line 1588, in generate
    return self.sample(
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/generation/utils.py", line 2678, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

It seems that the model's hidden state outputs are all nan.

BaseModelOutputWithPast(last_hidden_state=tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]], device='cuda:0',
       dtype=torch.float16), past_key_values=((tensor([[[[-1.7842,  0.6445,  0.9375,  ..., -1.8057,  1.9531, -2.0898],
          [-0.0104, -0.5850,  0.0938,  ...,  0.6738, -0.7227,  0.6948],
          [ 1.1318,  3.6055, -0.5405,  ..., -2.7910,  4.2617, -2.5684],
          ...,
          [ 0.8984,  3.0488,  0.4187,  ..., -2.5312,  3.5684, -2.3633],
          [ 0.6392,  1.5225, -0.4216,  ..., -0.7524,  1.3262, -0.6392],
          [-0.0755,  0.3811,  0.0687,  ...,  0.4375, -0.8335,  0.4001]],

         [[-0.4683,  1.1523,  0.1193,  ..., -0.9561,  1.3672, -0.9609],
          [ 0.0293,  0.4148, -0.1000,  ..., -0.3264, -0.1302, -0.1819],
          [ 2.3184, -2.0352,  1.4316,  ..., -1.8232,  2.5410, -1.8711],
          ...,
          [ 1.9541, -1.8486, -1.5303,  ..., -1.3906,  2.0664, -1.4053],
          [ 0.7222, -0.9507, -0.5615,  ..., -0.4160,  0.9175, -0.4568],
          [ 0.4136, -1.2432,  0.7637,  ...,  1.0225, -0.6924,  1.0254]],

         [[-0.9062, -2.0078, -0.9204,  ..., -0.7695, -0.4683, -0.3967],
          [-0.0662,  0.4402,  0.1777,  ...,  1.8486,  1.9346,  1.8145],
          [ 1.2881,  0.0416,  0.6675,  ..., -2.3965, -2.6621, -2.6934],
          ...,
          [ 0.1807, -1.1299, -0.6143,  ..., -2.4707, -2.8301, -2.9219],
          [-1.0068,  0.0765, -0.5195,  ..., -1.5000, -1.7148, -1.7324],
          [ 0.2966, -0.6782,  0.0466,  ...,  1.0801,  1.2441,  1.1377]],

         ...,

Could you help me understand what might be causing this issue and how to resolve it? Thank you very much

cherry956 commented 9 months ago

@rorubyy I have run scripts like you. But I encount new issues. 屏幕截图 2024-02-12 112205 How I should fix the max model sequence len and fix the problem about uploading many images? Here is my test-json: 屏幕截图 2024-02-12 112631

monjurulkarim commented 8 months ago

@rorubyy What size is your custom dataset? I'm curious about its performance with smaller datasets.

chanangad commented 6 months ago

hi @rorubyy

Were you able to figure out the reason for the hidden states being nan? I'm facing the same issue

shreyanshu09 commented 6 months ago

Hello @rorubyy @chanangad

I am also facing the same issue. Does anyone have a solution or any ideas on how to fix it?

wentaoyuan commented 5 months ago

I encountered the same issue while running model_vqa.py with a fine-tuned 7b model.

ghazalsaheb commented 3 months ago

I used to have the same issue and I figured it was because I was using hugging face's "llava-hf/llava-1.5-7b-hf" as the base model. I switched the base to "liuhaotian/llava-v1.5-7b" and it resolved the NaN issue. Plus, the training performance got much better.

haotian-liu / LLaVA

[Question] Fine-Tuning LLaVA v1.5-7B lora on Custom Dataset and RuntimeError in Model Evaluation #1030

Question