unum-cloud / uform

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️
https://unum-cloud.github.io/uform/
Apache License 2.0
982 stars 56 forks source link

How to cite your work #43

Closed Shubodh closed 11 months ago

Shubodh commented 11 months ago

I can't find any research paper corresponding to this work. How can I cite your work in my research paper? I need it in the form of bibtex, for example like below:

@misc{shukor2022efficient,
      title={Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment}, 
      author={Mustafa Shukor and Guillaume Couairon and Matthieu Cord},
      year={2022},
      eprint={2208.13628},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
ashvardanian commented 11 months ago

There is a button for that on the main page of the repo. Would that work, @Shubodh?

IMG_1063

Shubodh commented 11 months ago

Thanks for your reply!

1. about citation

A bibtex citation to a more detailed work like a research paper or even a detailed blog would be more helpful. I have gone through your blog and GitHub README but it has very little details as follows:

The multimodal part takes unimodal features from the unimodal part as input and enhances them with a cross-attention mechanism.

I need to be more detailed in my paper.

2. A question about retraining

Also can i only run inference using your code? Is there a way to re-train your pretrained mid-fusion models? @ashvardanian

kimihailv commented 11 months ago

@Shubodh Hello! Thank you for your interest.

  1. Unfortunately, we doesn't have a paper. But we can reveal more details:

UForm is a two tower model with an image and text encoder (ViT for the image encoder, BERT for the text encoder, configs of models can be found in HF repositories). The text encoder consists of two parts – the unimodal part and the multimodal part. The unimodal part encodes a text using standard transformer layers with self-attention. The multimodal part additionally incorporates information from the image by cross-attention mechanism. The model was trained by minimizing three objectives: usual contrastive loss (unimodal part + image encoder), MLM (multimodal part + image encoder) and image text matching loss (multimodal part + image encoder). More details about objectives can be found in the paper "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation". Training data: the part of CC + MSCOCO train split + VG + SBU. The multilingual model was trained on the balanced dataset with NLLB translations, special losses for cross-lingual properties were used.

  1. We are not going to publish training code. However our models are usual torch models, their weights are publicly available on HF. So you can re-train/fine-tune it like usual torch models.