Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Ch4osMy7h commented 3 years ago

这篇论文是谷歌出品的又一个很有意义的工作。该文提出了一个非常简单的方法，来提高模型在vision-language和vision的表示能力。该方法通过对原始语料进行简单处理从而获得大规模的噪声数据集，接着采用对比学习的方法在一个非常基础的dual-encoder模型进行预训练（image caption任务）。该模型现在在许多任务上都刷新了SOTA。

信息

主要作者：Chao Jia 和 Tom Duerig
单位：Google Research
论文链接

1 学习到的新东西：

在这篇文章中我学习到了对比学习的概念，通过在一个batch内构建一定比例的正负训练样例，在训练时让模型的特征空间更偏向正例，更远离负例。该方法已经在CV方面得到了非常广泛的应用。
这篇文章用了超大篇幅来说他们的实验。他们在许多跨模态的benchmark做了实验，并且还在zero-shot和单模态设定上和先前的工作进行了对比。除此之外，模型还做了非常细致的消融实验，在许多具体参数上进行了对比。谷歌甚至还实现了一个系统，并且在系统里展示了他们模型在具体应用上的表现。以后的实验设计要学习他们这种具体且合理的思路。

2 通过Related Work了解到了哪些知识

提供了非常前沿、非常具体的跨模态研究工作的参考文献，后期深入学习可以从中找到自己想要的。

3 实验验证任务，如果不太熟悉，需要简单描述

在1中第二点写到了。

4 在你认知范围内，哪些其它任务可以尝试

除了image caption以外其他的跨模态任务。

5 好的句子

Pre-training has also become the de-facto approach in vision-language modeling (Lu et al., 2019; Chen et al., 2020c; Li et al., 2020).
The resulting dataset is noisy, but is two orders of magnitude larger than the Conceptual Captions dataset.
To train our model, we use an objective that aligns the visual and language representations in a shared latent embedding space using a simple dual-encoder architecture.
Image and text encoders are learned via a contrastive loss (formulated as normalized softmax) that pushes the embeddings of matched image-text pair together while pushing those of non-matched image-text pair apart.
Recently more advanced models emerge with cross-modal attention layers (Liu et al., 2019a; Lu et al., 2019; Chen et al., 2020c; Huang et al., 2020b) and show superior performance in image-text matching tasks.
ALIGN follows the natural distribution of image-text pairs from the raw alt-text data.
Instead of manually sweeping for the optimal temperature value, we ﬁnd that it can be effectively learned together with all the other parameters.

izhx commented 3 years ago

2021 ICML ？论文链接呢？

Ch4osMy7h commented 3 years ago

2021 ICML ？论文链接呢？

1

ZihaoZheng98 commented 2 years ago

哇这是你们组阅读论文的笔记吗，这几个问题好有意义，学习了。

izhx / paper-reading

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision #9

信息

1 学习到的新东西：

2 通过Related Work了解到了哪些知识

3 实验验证任务，如果不太熟悉，需要简单描述

4 在你认知范围内，哪些其它任务可以尝试

5 好的句子