Closed tadejsv closed 2 years ago
proposes two ways of improving sentence embeddings using contrastive learning.
Using Unsupervised learning: Do two forward passes through a dropout, use the slightly different embeddings as positives and the rest of the sentence embeddings in that mini batch as negatives.
Using supervised learning: Leverage NLI datasets to improve embeddings, use entailment pairs as positives and contradiction pairs as negatives.
SimCSE is evaluated on standard semantic textual similarity (STS) tasks, and the unsupervised and supervised models using BERTbase achieve an average of 76.3% and 81.6% Spearman’s correlation respectively, a 4.2% and 2.2% improvement compared to previous best results.
adds noise to the input sentence (swapping or deleting words) and asks the autoencoder to reconstruct the original input. They modify the decoder input such that it decodes only from a fixed-size sentence representation produced by the encoder. It does not have access to all contextualized word embeddings from the encoder.
TSDAE improves embeddings with up to to 6.4 MAP. It can achieve up to 93.1% of the performance of indomain supervised approaches.
They also criticize the metric STS for lack of correlation with downstream in domain tasks which is why they revert to MAP.
They tests two setups: Supervised and Unsupervised:
Unsupervised: they assume they have unlabeled sentences from the target task and tune our approaches based on these sentences.
Supervised:
Try out
Use a more-or-less standardized dataset for text similarity (e.g. from benchmarks used in the papers), and try to compare results. Not necessary to use finetuner for now, as we are exploring techniques to integrate into fintuner later