Add Loss Functions for QFormer Training in BLIP-2 Model (ITC, ITM, and ITG)

Feature request

I propose adding a loss calculation for QFormer training in the BLIP-2 model. Implementing this feature would allow fine-tuning the QFormer and language models for image-text retrieval and captioning tasks, which is crucial for practical applications.

Motivation

I want to train the BLIP-2 model using the transformers library. In particular, loss functions for Image-Text Contrastive (ITC), Image-Text Matching (ITM), and Image-grounded Text Generation(ITG) are not included, which requires users to manually implement the loss functions.

Your contribution

I would like to contribute to this open-source project by implementing the loss functions.

huggingface / transformers