Implement QFormer for pretrain

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

135.77k stars 27.18k forks source link

Implement QFormer for pretrain #22645

Closed dinhanhx closed 1 year ago

dinhanhx commented 1 year ago

Feature request

In BLIP-2, there is a pretraining stage (or stage 1) of QFormer.

Implementation of QFormer in this stage is requested.

Motivation

In HuggingFace's source code of BLIP-2, I see no implementations for text inputs, Image-text contrastive loss, Image-grounded text generation loss, Image-text matching loss for pretraining. Currently, The source code only provides for vision-language generative learning (stage 2). Therefore, it will be very helpful for people who are interested in stage 1 of QFormer (like me).

Your contribution

Unfortunately, I don't think there is a way that I could help.

dinhanhx commented 1 year ago

@NielsRogge Gentle ping because I saw your name in the docs

sgugger commented 1 year ago

cc @younesbelkada

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

jianantian commented 1 year ago

as the issue is reopened, is there any plan to impl the loss for qformer?

younesbelkada commented 1 year ago

Hi @jianantian , I didn't had time to have a look unfortunately, if you want to try your hands on it, feel free to open a PR and we'll guide you frm there!

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.