BLIP generated caption length

giacomocamposampiero commented 1 year ago

Hi, thanks for the amazing work you did with the library!

I am currently trying to fine-tune BLIP on a custom dataset. I followed your tutorial on the custom dataset generation and set up all the necessary files for the fine-tuning, and everything works as expected. The only problem I've encountered is with the maximum length of the generated captions. In my training configuration file this length is set to 256, but the model never generates captions which are longer than ~50 words (roughly 90 tokens in average).

I have already increased BERT embedding's size to 256, hard-coding it in this line: https://github.com/salesforce/LAVIS/blob/6c6c981b8ea5a64ee9e706cf003559f7d8be085e/lavis/models/blip_models/blip_caption.py#L51 and changed the default max_lengths to 256 here: https://github.com/salesforce/LAVIS/blob/6c6c981b8ea5a64ee9e706cf003559f7d8be085e/lavis/models/blip_models/blip_caption.py#L214 and here https://github.com/salesforce/LAVIS/blob/6c6c981b8ea5a64ee9e706cf003559f7d8be085e/lavis/models/blip_models/blip_caption.py#L141

My training config file looks like this

model:
  arch: blip_caption

  model_type: base_coco
  load_finetuned: False

datasets:
  custom_caption: # name of the dataset builder
    vis_processor:
        train:
          name: "blip_image_train"
        eval:
          name: "blip_image_eval"
    text_processor:
        train:
          name: "blip_caption"
          prompt: "a picture of "
        eval:
          name: "blip_caption"

run:
  task: captioning
  # optimizer
  lr_sched: "linear_warmup_cosine_lr"
  init_lr: 1e-5
  min_lr: 0
  weight_decay: 0.05
  max_epoch: 20
  batch_size_train: 2
  batch_size_eval: 8
  num_workers: 1

  max_len: 256
  min_len: 5
  num_beams: 3

  seed: 42
  output_dir: "output/BLIP/Caption_custom"

  amp: False
  resume_ckpt_path: null

  evaluate: False 
  train_splits: ["train"]
  valid_splits: ["val"]
  test_splits: ["test"]

  device: "cuda"
  world_size: 1
  dist_url: "env://"
  distributed: True

I am training the model with 5000 samples. Do you have any suggestion on what could be wrong or missing in my fine tuning configuration? Should I use different parameters for the optimiser? Is generating captions of this length even achievable with BLIP?

Thanks!

dxli94 commented 1 year ago

Hi @giacomocamposampiero, thanks for your interest. Glad to hear you are making progress.

Your configuration looks good to me. I see two possible aspects that you might want to consider:

training data size: BLIP is trained mostly on short captions. It could be the case that finetuning BLIP on 5k samples for 20 epochs is not enough to shift its behaviours towards generating long captions. If possible, you might want to supplement your training data.
BERT capacity. BERT was not initially trained for text generation. If you observe that even for training data, long captions cannot be well fitted even after some over-fitting experiments, then this might be the case. In this case, you might want to consider other language models, such as GPT, yet re-train the VL model may be required.

By the way, just curious what is the data you are using? Is it like a collection of images described by a paragraph each?

These are just my guess. Please feel welcome to discuss.

Thanks.

giacomocamposampiero commented 1 year ago

Thanks @dxli94 for the quick answer and your meaningful suggestions! I will try to increase the training data size/number of epochs and, if that doesn't make it, to explore different language models more suitable for longer text generation.

About the data: yes, I'm using a collection of images described by a paragraph each. The images however are quite simple (compositions of abstract geometric shapes) and the captions very structured and repetitive, hence I was hoping that my current data would have been enough to fine-tune the model.

shams2023 commented 9 months ago

您好，感谢您为图书馆所做的出色工作！

我目前正在尝试在自定义数据集上微调 BLIP。我按照您关于自定义数据集生成的教程进行操作，并设置了所有必要的文件进行微调，一切都按预期进行。我遇到的唯一问题是生成的字幕的最大长度。在我的训练配置文件中，此长度设置为 256，但模型永远不会生成长度超过约 50 个单词（平均大约 90 个标记）的标题。

我已经将 BERT 嵌入的大小增加到 256，并在这一行中对其进行硬编码：

https://github.com/salesforce/LAVIS/blob/6c6c981b8ea5a64ee9e706cf003559f7d8be085e/lavis/models/blip_models/blip_caption.py#L51

并将默认 max_lengths 更改为 256： https://github.com/salesforce/LAVIS/blob/6c6c981b8ea5a64ee9e706cf003559f7d8be085e/lavis/models/blip_models/blip_caption.py#L214

和这里 https://github.com/salesforce/LAVIS/blob/6c6c981b8ea5a64ee9e706cf003559f7d8be085e/lavis/models/blip_models/blip_caption.py#L141

我的训练配置文件如下所示
model:
  arch: blip_caption

  model_type: base_coco
  load_finetuned: False

datasets:
  custom_caption: # name of the dataset builder
    vis_processor:
        train:
          name: "blip_image_train"
        eval:
          name: "blip_image_eval"
    text_processor:
        train:
          name: "blip_caption"
          prompt: "a picture of "
        eval:
          name: "blip_caption"

run:
  task: captioning
  # optimizer
  lr_sched: "linear_warmup_cosine_lr"
  init_lr: 1e-5
  min_lr: 0
  weight_decay: 0.05
  max_epoch: 20
  batch_size_train: 2
  batch_size_eval: 8
  num_workers: 1

  max_len: 256
  min_len: 5
  num_beams: 3

  seed: 42
  output_dir: "output/BLIP/Caption_custom"

  amp: False
  resume_ckpt_path: null

  evaluate: False 
  train_splits: ["train"]
  valid_splits: ["val"]
  test_splits: ["test"]

  device: "cuda"
  world_size: 1
  dist_url: "env://"
  distributed: True
我正在用 5000 个样本训练模型。您对我的微调配置中可能存在的错误或缺失有什么建议吗？我应该为优化器使用不同的参数吗？使用 BLIP 是否可以生成这么长的字幕？

谢谢！

Hi, brother! May I ask if you also do image captions? I also want to use blip2 to generate image caption for my dataset. Have you implemented it? What is the quality of his captions for image generation? Do you need to make minor adjustments?

giacomocamposampiero commented 9 months ago

Hello, it did not work in the end for me because I was not able to generate labels longer than 50 words.

shams2023 commented 9 months ago

Hello, it did not work in the end for me because I was not able to generate labels longer than 50 words.

If you want to generate longer sentences, you can try using the llava model. I have tried using the text generated by his demo

salesforce / LAVIS

BLIP generated caption length #45