salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.72k stars 955 forks source link

Inability to reproduce BLIP2 VQAv2 finetune results #460

Open CupidJay opened 1 year ago

CupidJay commented 1 year ago

I tried to reproduce the finetuning results of BLIP2 FlanT5xl on VQAv2, but the results I got are far from those in the paper. I only got the highest accuracy of 76.58% while the paper is 81.55%, I want to figure out what's wrong with my code.

I modified the forward code according to this and I also added the Instruct implementation. My yaml configuration is as follows

model:
  arch: blip2_t5
  model_type: pretrain_flant5xl
  load_pretrained: True
  pretrained: '/share/datasets/blip2_pretrained_flant5xl.pth'
  vit_model: eva_clip_g

  # vit encoder
  image_size: 400
  drop_path_rate: 0
  use_grad_checkpoint: False
  vit_precision: "fp32"
  freeze_vit: False

  # Q-Former
  num_query_token: 32

datasets:
  coco_vqa:
    vis_processor:
        train:
          name: "blip_image_train"
          image_size: 400
        eval:
          name: "blip_image_eval"
          image_size: 400
        test:
          name: "blip_image_eval"
          image_size: 400
    text_processor:
        train:
          name: "blip_question"
        eval:
          name: "blip_question"
        test:
          name: "blip_question"
  vg_vqa: # name of the dataset builder
    vis_processor:
        train:
          name: "blip_image_train"
          image_size: 400
    text_processor:
        train:
          name: "blip_question"
run:
  task: vqa
  # optimizer
  lr_sched: "linear_warmup_cosine_lr"
  init_lr: 1e-5 
  min_lr: 0 
  warmup_steps: 1000
  warmup_lr: 1e-8
  weight_decay: 0.05
  max_epoch: 5
  batch_size_train: 8 
  batch_size_eval: 32
  num_workers: 4
  accum_grad_iters: 1
  lr_layer_decay: 0.95 # layer-wise learning rate decay for the ViT 

  max_len: 10
  min_len: 1
  num_beams: 5
  inference_method: "generate"
  prompt: "Question: {} Short answer:"

  seed: 42
  output_dir: "output/BLIP2_A100/flanT5_VQA"

  amp: True
  resume_ckpt_path: null

  evaluate: False
  train_splits: ["train"]
  valid_splits: ["val"]
  test_splits: ["val"]

  device: "cuda"
  world_size: 1
  dist_url: "env://"
  distributed: True

I really appreciate your great work and can you help me see where is the problem?

simon-ging commented 1 year ago

I also finetuned and carefully implemented all details from the paper, but got only 76.80. Had to reduce the image size due to computational costs but still would expect a better result even at 224 px.

I would kindly ask if you could upload the finetuned model T5+ViTG somewhere.

Thank you for all your valuable contributions to the field.