SimCSE: Simple Contrastive Learning of Sentence Embeddings

将对比学习应用到句子表示任务当中。使用dropout作为数据增强的手段来构造正负例，即将同一个句子送入model中两次，由于dropout是随机的，则句子最终的表示发生变化，但是句子实际的语义是不变的。

信息

衡量对比学习向量表示的两个指标Alignment和uniformity，前者是正例间的相似程度，后者是负例见的差异程度。用着两个做实验分析很不错。
用dropout做数据增强，简单有效，详细见上方
BERT的向量表示存在一个坍缩的问题，英文表述叫做representation degeneration或者Anisotropy。主要意思就是说大部分的向量表示都集聚在向量空间的某一部分，而不是分散在广阔的语义向量空间中，这极大的限制了其可能的表示性。

其实对于解决BERT的向量坍缩的问题，已经有了一些工作。 1 2 3

目前的对比学习都是在样本级别上构造正负例，即样本和扰动后的自己构成正例。可以考虑在跨语言跨领域任务上，在领域和语言这个级别上面构造正例，然后看看能不能学习到一些领域或者语音相关的特征。(不太成熟的想法，没有仔细想过)

We take the checkpoints of these models every 10 steps during training and visualize the alignment and uniformity metrics in Figure 2, along with a simple data augmentation model “delete one word”. As is clearly shown, all models largely improve the uniformity. However, the alignment of the two special variants also degrades drastically, while our unsupervised SimCSE keeps a steady alignment, thanks to the use of dropout noise. On the other hand, although “delete one word” slightly improves the alignment, it has a smaller gain on the uniformity, and eventually underperforms unsupervised SimCSE.