Closed Shubodh closed 1 year ago
There is a button for that on the main page of the repo. Would that work, @Shubodh?
Thanks for your reply!
A bibtex citation to a more detailed work like a research paper or even a detailed blog would be more helpful. I have gone through your blog and GitHub README but it has very little details as follows:
The multimodal part takes unimodal features from the unimodal part as input and enhances them with a cross-attention mechanism.
I need to be more detailed in my paper.
Also can i only run inference using your code? Is there a way to re-train your pretrained mid-fusion models? @ashvardanian
@Shubodh Hello! Thank you for your interest.
UForm is a two tower model with an image and text encoder (ViT for the image encoder, BERT for the text encoder, configs of models can be found in HF repositories). The text encoder consists of two parts – the unimodal part and the multimodal part. The unimodal part encodes a text using standard transformer layers with self-attention. The multimodal part additionally incorporates information from the image by cross-attention mechanism. The model was trained by minimizing three objectives: usual contrastive loss (unimodal part + image encoder), MLM (multimodal part + image encoder) and image text matching loss (multimodal part + image encoder). More details about objectives can be found in the paper "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation". Training data: the part of CC + MSCOCO train split + VG + SBU. The multilingual model was trained on the balanced dataset with NLLB translations, special losses for cross-lingual properties were used.
I can't find any research paper corresponding to this work. How can I cite your work in my research paper? I need it in the form of bibtex, for example like below: