huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.37k stars 27.09k forks source link

ALBEF: Align Before Fuse #17224

Open ggoggam opened 2 years ago

ggoggam commented 2 years ago

Model description

Align Before Fuse (ALBEF) is a vision-language (VL) model that showed competitive results in numerous VL tasks such as image-text retrieval, visual question answering, visual entailment, and visual grounding.

The authors propose to use text encoder (BERT's first half layers) and image encoder (ViT) to create an aligned representation for respective modality before fusing them together with a multi-modal encoder (BERT's second half layers). The model is trained on multi-modal representation tasks and momentum distillation to achieve state-of-the-art results in VL tasks.

As multi-modal models are gaining more attention in academia/industry, I think ALBEF could be a nice addition to the transformers library.

Open source status

Provide useful links for the implementation

DanielFLevine-zz commented 2 years ago

@jkgrad What is the state of this issue? If no one is working on this, I would like to implement it.

LysandreJik commented 2 years ago

Hey @DanielFLevine, we'd love for you to try and contribute that model!

cc @NielsRogge who can help out once he's back from leave :)

DanielFLevine-zz commented 2 years ago

@LysandreJik @NielsRogge Great! I've already started looking over the authors' code. Will reach out with any questions.

ethansmith2000 commented 1 year ago

Is there still interest for this?

Yassin-fan commented 1 year ago

Same question

amyeroberts commented 1 year ago

@DanielFLevine-zz - any updates on the model port?

chengjiali commented 5 months ago

any progress on this?