Pre-training has also become the de-facto approach in vision-language modeling (Lu et al., 2019; Chen et al., 2020c; Li et al., 2020).
The resulting dataset is noisy, but is two orders of magnitude larger than the Conceptual Captions dataset.
To train our model, we use an objective that aligns the visual and language representations in a shared latent embedding space using a simple dual-encoder architecture.
Image and text encoders are learned via a contrastive loss (formulated as normalized softmax) that pushes the embeddings of matched image-text pair together while pushing those of non-matched image-text pair apart.
Recently more advanced models emerge with cross-modal attention layers (Liu et al., 2019a; Lu et al., 2019; Chen et al., 2020c; Huang et al., 2020b) and show superior performance in image-text matching tasks.
ALIGN follows the natural distribution of image-text pairs from the raw alt-text data.
Instead of manually sweeping for the optimal temperature value, we find that it can be effectively learned together with all the other parameters.
这篇论文是谷歌出品的又一个很有意义的工作。该文提出了一个非常简单的方法,来提高模型在vision-language和vision的表示能力。该方法通过对原始语料进行简单处理从而获得大规模的噪声数据集,接着采用对比学习的方法在一个非常基础的dual-encoder模型进行预训练(image caption任务)。该模型现在在许多任务上都刷新了SOTA。
信息
1 学习到的新东西:
2 通过Related Work了解到了哪些知识
提供了非常前沿、非常具体的跨模态研究工作的参考文献,后期深入学习可以从中找到自己想要的。
3 实验验证任务,如果不太熟悉,需要简单描述
在1中第二点写到了。
4 在你认知范围内,哪些其它任务可以尝试
除了image caption以外其他的跨模态任务。
5 好的句子