[Usage] Inference Speed Issue with LoRA Fine-tuned Model on ScienceQA

Hi Haotian,

Thank you for your incredible work on this project.

I am encountering an issue during inference. When I use the non-LoRA weights for inference on ScienceQA, the speed is approximately 1 second per sample. However, when I switch to the LoRA fine-tuned model, the inference speed drastically increases to over 40 seconds per sample.

Here is the command I am using for fine-tuning (trained on 1 V100 with lora_r=4, bf16=False, tf32=False):

CUDA_VISIBLE_DEVICES=1 python3 llava/train/train.py \
    --lora_enable True --lora_r 4 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --model_name_or_path ./LLAVA-1.5/llava-v1.5-7b/ \
    --version v1 \
    --data_path ./playground/data/eval/scienceqa/llava_train_CQM-A.json \
    --image_folder ./data/ScienceQA/image/train/ \
    --vision_tower ./data/clip-vit-large-patch14-336/ \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 False \
    --output_dir ./LLaVA-v1.5-7b-lora \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

Here is the command I am using for inference:

CUDA_VISIBLE_DEVICES=3 python3 -m llava.eval.model_vqa_science \
    --model-path ./LLaVA-v1.5-7b-lora/checkpoint-50000/ \
    --model-base ./LLAVA-1.5/llava-v1.5-7b/ \
    --question-file ./playground/data/eval/scienceqa/llava_test_CQM-A.json \
    --image-folder ./data/ScienceQA/image/test/ \
    --answers-file ./playground/data/eval/scienceqa/answers/llava-v1.5-7b-lora-50000.jsonl \
    --single-pred-prompt \
    --temperature 0 \
    --conv-mode vicuna_v1

Could you please help me understand why the inference speed difference between the two models is significant?

Thank you!

Screenshots: 屏幕截图 2024-11-12 144626

adapter_config.json:

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "./data/LLAVA-1.5/llava-v1.5-7b/",
  "bias": "none",
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layer_replication": null,
  "layers_pattern": null,
  "layers_to_transform": null,
  "loftq_config": {},
  "lora_alpha": 256,
  "lora_dropout": 0.05,
  "megatron_config": null,
  "megatron_core": "megatron.core",
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 4,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "down_proj",
    "o_proj",
    "q_proj",
    "gate_proj",
    "up_proj",
    "v_proj",
    "k_proj"
  ],
  "task_type": "CAUSAL_LM",
  "use_dora": false,
  "use_rslora": false

config.json

{
  "_name_or_path": "./data/LLAVA-1.5/llava-v1.5-7b/",
  "architectures": [
    "LlavaLlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "freeze_mm_mlp_adapter": false,
  "freeze_mm_vision_resampler": false,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "image_aspect_ratio": "pad",
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_length": 4096,
  "max_position_embeddings": 4096,
  "mm_hidden_size": 1024,
  "mm_patch_merge_type": "flat",
  "mm_projector_lr": 2e-05,
  "mm_projector_type": "mlp2x_gelu",
  "mm_resampler_type": null,
  "mm_use_im_patch_token": false,
  "mm_use_im_start_end": false,
  "mm_vision_select_feature": "patch",
  "mm_vision_select_layer": -2,
  "mm_vision_tower": "./data/clip-vit-large-patch14-336/",
  "model_type": "llava_llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "tokenizer_model_max_length": 2048,
  "tokenizer_padding_side": "right",
  "torch_dtype": "float16",
  "transformers_version": "4.37.2",
  "tune_mm_mlp_adapter": false,
  "tune_mm_vision_resampler": false,
  "unfreeze_mm_vision_tower": false,
  "use_cache": true,
  "use_mm_proj": true,
  "vocab_size": 32000
}

haotian-liu / LLaVA

[Usage] Inference Speed Issue with LoRA Fine-tuned Model on ScienceQA #1763