jina-ai / finetuner

:dart: Task-oriented embedding tuning for BERT, CLIP, etc.
https://finetuner.jina.ai
Apache License 2.0
1.47k stars 66 forks source link

Explore text self-supervised learning techniques #298

Closed tadejsv closed 2 years ago

tadejsv commented 2 years ago

Try out

Use a more-or-less standardized dataset for text similarity (e.g. from benchmarks used in the papers), and try to compare results. Not necessary to use finetuner for now, as we are exploring techniques to integrate into fintuner later

azayz commented 2 years ago

SimCSE: stands for simple contrastive learning for sentence embeddings

proposes two ways of improving sentence embeddings using contrastive learning.

  1. Using Unsupervised learning: Do two forward passes through a dropout, use the slightly different embeddings as positives and the rest of the sentence embeddings in that mini batch as negatives.

  2. Using supervised learning: Leverage NLI datasets to improve embeddings, use entailment pairs as positives and contradiction pairs as negatives.

SimCSE is evaluated on standard semantic textual similarity (STS) tasks, and the unsupervised and supervised models using BERTbase achieve an average of 76.3% and 81.6% Spearman’s correlation respectively, a 4.2% and 2.2% improvement compared to previous best results.

TSDAE: Transformer Based Denoising AutoEncoder

adds noise to the input sentence (swapping or deleting words) and asks the autoencoder to reconstruct the original input. They modify the decoder input such that it decodes only from a fixed-size sentence representation produced by the encoder. It does not have access to all contextualized word embeddings from the encoder.

TSDAE improves embeddings with up to to 6.4 MAP. It can achieve up to 93.1% of the performance of indomain supervised approaches.

They also criticize the metric STS for lack of correlation with downstream in domain tasks which is why they revert to MAP.

They tests two setups: Supervised and Unsupervised:

Unsupervised: they assume they have unlabeled sentences from the target task and tune our approaches based on these sentences.

Supervised:

  1. Training on NLI+STS data, then unsupervised training to the target domain.
  2. Unsupervised training on the target domain, then supervised training on NLI + STS.
azayz commented 2 years ago

https://docs.google.com/presentation/d/1GhYggjsMg-7YvEFyD-1a9-jOxSY26IiI/edit?usp=sharing&ouid=116224065734438057464&rtpof=true&sd=true