Open ggoggam opened 2 years ago
@jkgrad What is the state of this issue? If no one is working on this, I would like to implement it.
Hey @DanielFLevine, we'd love for you to try and contribute that model!
cc @NielsRogge who can help out once he's back from leave :)
@LysandreJik @NielsRogge Great! I've already started looking over the authors' code. Will reach out with any questions.
Is there still interest for this?
Same question
@DanielFLevine-zz - any updates on the model port?
any progress on this?
Model description
Align Before Fuse (ALBEF) is a vision-language (VL) model that showed competitive results in numerous VL tasks such as image-text retrieval, visual question answering, visual entailment, and visual grounding.
The authors propose to use text encoder (BERT's first half layers) and image encoder (ViT) to create an aligned representation for respective modality before fusing them together with a multi-modal encoder (BERT's second half layers). The model is trained on multi-modal representation tasks and momentum distillation to achieve state-of-the-art results in VL tasks.
As multi-modal models are gaining more attention in academia/industry, I think ALBEF could be a nice addition to the transformers library.
Open source status
Provide useful links for the implementation