Self-training Improves Pre-training for Natural Language Understanding

https://arxiv.org/pdf/2010.02194v1.pdf

Arrive 2020 Facebook ai

외부 언레이블 데이터를 사용하는 SentAugment 제안

SentAugment, a data augmentation method which computes task-specific query embed- dings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web.

Retrieval task에서는 Transformer encoder를 triplet loss로 훈련 시킨 것을 사용한다.

위 그림 하나로 정리 가능.

Unannotated data에서 먼저 사용할 데이터를 선별 한 후, 모델을 이용해서 레이블링을 해줌

Augmentation data는 기존 training set의 비율(분포)를 따르도록 조절함

student 모델은 기존 학습 데이터를 사용하지는 않고 새로 생성된 데이터를 바탕으로 Teacher model로 부터 knowledge distillation함

KD방법으로 soft label을 바탕으로 KLD 학습함

Sentence embedding 은 여기서 제안하는 SASE 사용

Few shot setting에서도 실험

Domain adaption에서는 continued pretraining 보다 self-training이 더 좋음

셀프트레이닝이 성능을 올려줌

많은양의 aug data를 sent augment에서 택하면 성능이 오름

Retrieval selection 마다 성능 증가가 다르지만 label-avg가 가장큼

외부 데이터느 1B일때 가장 좋음

KLD로 학습할 때 label을 discrete보다는 logits (continuos) 하게 주는 것이 더 좋음

toriving / Plz_Read_The_Paper

Self-training Improves Pre-training for Natural Language Understanding #54