[ELECTRA: Pre-Training Text Encoders as Discriminatiors rather than Generators (2020)]

요약 및 Contribution

1. Generators 보다는 Discriminatiors 로써의 훈련

every token에 대해 original인지/replacement인지 예측하는 discriminator로써 네트워크를 pre-train.
(vs. MLM은 손상된 tokens의 원래 identities를 예측하는 generator로써 네트워크를 훈련시켰었음)
이러한 discriminative task의 주요 장점은, 모델이 small masked-out subset(BERT에서의 input의 15%)이 아닌 all input tokens를 학습하여, 보다 계산 효율성을 높인다는 점입니다.

2. GAN이 떠오르지만 적대적이지 않은 방법

GANs을 텍스트에 적용하는 것의 어려움 때문에, maximum likelihood로 손상 tokens을 생성하는 Generator를 훈련하기에, not adversarial.

* ELECTRA : Efficiently Learning an Encoder that Classifies Token Replacements Accurately.

주요 아이디어는 small generator network에 의해 생성된 그럴듯한, 높은 퀄리티의, 까다로운 negative samples(=real 토큰이 아닌)로부터 input tokens를 구별하는, text encoder로써의 discriminator를 훈련하는 것 !

모델 성능

(1)문맥적 표현(contextual representations)이 동일한 model size, data, compute에서 BERT로 학습된 것보다 성능이 높음.
(2)특히, small model의 경우, 1개의 GPU에서 4일 훈련함으로써 GPT 모델보다 컴퓨팅 30배 적게 사용함.
(3)RoBERT와 XLNet도 1/4의 컴퓨팅만 사용해서 더 높은 성능을 냄.

Motivation

기존 MLM 모델의 단점: Large amounts of compute

BERT와 같은 Masked Language Model(MLM) pre-training 방법론들은 downstream한 NLP tasks에 트랜스퍼할 때 좋은 결과를 만드는 데 반해, 효과적이기 위해서는 일반적으로 많은 양의 계산이 요구되었습니다.
대안으로, "replaced token detection"이라 불리는 보다 "sample-efficient pre-training task"를 제안합니다.
우리의 접근법은, 입력을 마스킹하는 대신, 일부 tokens을 small generator network로부터 샘플링한 그럴듯한 대안으로 대체시키는 것입니다.
그런 다음, 손상된 토큰(corrupted tokens)의 원래 identities를 예측하는 모델을 훈련하는 대신, 손상된 입력의 각 token이 generator sample로 대체된 것인지/아닌지를 예측하는 a discriminative model을 훈련합니다.

Previous & Approach

1. Replaced token detection

최근 언어에 대한 학습 방법론은 "learning denoising autoencoders"로 볼 수 있습니다.(Vincent et al., 2008) 최근 방법론들은 unlabled input sequence에서 작은 하위집합(통상 15%)을 선택하고 이 token들의 identities를 마스킹_(e.g., BERT; Devlin et al. (2019))하거나 이 token들에 어텐션을 주는(e.g., XLNet; Yang et al. (2019))_ , 그리고 나서 original input을 복원하고자 훈련시킵니다.
이러한 MLM 접근법들은 기존의 "conventional language model pre-training" 보다 효과적이지만, 상당한 compute cost를 발생시킵니다. (네트워크가 예제 당 tokens의 15%에서만 학습되기 때문에 !)
대안인 'Replaced token detection'은 마스킹 대신, a proposal distribution을 통한 샘플로 일부 token들을 교체함으로써 input을 손상시키는 접근법입니다. 이 샘플은 일반적으로 작은 masked language model의 output입니다.
이 corruption procedure은 (XLNet에서는 없었지만) BERT에서 존재했던, 네트워크가 artificial [MASK]을 pre-training 동안에는 볼 수 있었지만, downstream tasks에 fine-tuning할 때는 볼 수 없었던 불일치를 해결합니다.

2. the difficulty of applyting GANs to text

이전 연구 (Caccia et al., 2018)에서, GAN을 텍스트에 적용하는 데 2가지 문제가 보였습니다.
1번) 적대적 Generator는 MLM에서 '정확도가 낮다'. : MLE로 훈련 시 65% 정확도와 비교해 MLM에서 58% 정확도를 달성합니다. (이는 주로 텍스트를 생성하는 대규모 작업 시 강화학습의 sample efficiency가 낮기 때문이라고 생각합니다.)
2법) 적대적으로 훈련된 Generator는 하나의 token에 대해 대부분의 확률 질량이 '낮은 엔트로피 출력 분포'를 생성하므로, generator samples에 다양성이 많지 않습니다.

Model

우리는 우리의 접근법 ELECTRA를 "Efficiently Learning an Encoder that Classifies Token Replacements Accurately".(토큰 교체를 정확하게 분류하는 인코더를 효율적으로 학습하는 방식)으로 부릅니다.
이전 연구(Vaswani et al., 2017)에서와 같이, 우리는 downstream tasks에 fine-tuned 할 수 있는 "Transformer text encoders"를 pretrain시키는 데 이를 적용합니다.
모든 Input 위치에서 학습시키는 것은 ELECTRA를 BERT보다 훨씬 빠르게 train하게 합니다. (2) 완전히 train되었을 때, downstream tasks에서 높은 정확도를 달성합니다.

Method

크게 generator G 와 discriminator D 2개의 신경망 네트워크를 훈련하는 접근법.

(Each one primarily consists of an encoder (e.g., a Transformer network) that maps a sequence on input tokens x = [x1, ..., xn] into a sequence of contextualized vector representations h(x) = [h1, ..., hn].)

(1) position _t_에 대해 Generator는 softmax 층을 통해 특정 token xt를 생성할 확률을 출력함.
(2) position _t_에 대해 Discriminator는 sigmoid 출력층을 통해 token xt가 'real' 데이터(즉, Generator 분포를 통하지 않은 데이터) 인지 여부를 예측함.

여기서 Generator는 BERT 와 같은 기존 사전학습된 MLM 모델을 사용. 마스킹할 위치 _t_는 uniform 분포를 통한 정수 1과 n 사이에서 랜덤하게 선택됨

Loss Function도 2가지

** GAN과 훈련 목적은 유사하지만 차이점 있음.

(1) Generator가 올바른 토큰을 생성하는 경우, fake가 아닌 real로 판단됨. - 이것은 downstream tasks에서 적당한 향상을 보임.
(2) discriminator를 속이기 위해 적대적으로 훈련되기 보다는, maximum likelihood로 훈련됨. - Generator에서 샘플링을 통해 backpropagate하는 것은 불가능하기 때문에 적대적 훈련 어려움(이 문제를 피하기 위해 강화학습으로 Generator를 훈련시켜도 봤지만, maximum likelihood보다 성능이 나빴음.
(3) GAN에서는 일반적으로 사용하는 Generator에 input으로 노이즈 벡터를 넣는 것을 사용하지 않음.

Combined Loss

raw text의 큰 코퍼스 _X_에 대한 결합 loss를 minimize. (sampling 단계 때문에 Generator를 통해 Discriminator의 loss를 역전파 하지는 못하고, 단일 샘플에 대한 loss에서의 기댓값을 근사함.)
pre-training 이후 Generator를 버리고 downstream tasks에서는 Discriminator를 fine-tune 한다 !

실험 성과

최근 pre-training methods는 효과적이려면 많은 양의 컴퓨팅이 요구되며, 이는 "cost"와 "accessibility" 문제와 연결됨.
pre-training에서 more compute는 대게 항상 better downstream accuracies를 내므로, 우리는 "1. compute efficiency"와 "2. absolute downstream performance"를 주요 고려사항으로 삼음.
GLUE_(NLU 벤치마크)와 SQuAD(질의응답 벤치마크)_ 에 대해 실험을 진행.
같은 모델 사이즈, 데이터, 컴퓨팅에서 BERT, XLNet과 같은 MLM 기반 기법보다 상당히 성능 우수.

ELECTRA-Small 모델의 성과

(1) 하나의 GPU에서 4일만에 훈련 가능
(2) GLUE_(NLU 벤치마크)_에서 small BERT 모델과 비교했을 때 5 points 높은 성능을 냈고, GPT 모델과 비교했을 땐 더 크게 높은 성능을 냄.

ELECTRA-Large 모델의 성과

(1) 적은 파라미터 수와 1/4만 사용한 컴퓨팅임에도 RoBERTa와 비슷한 성능.
(2) GLUE(NLU 벤치마크) 와 SQuAD(질의응답 벤치마크) 에서 ALBERT를 능가.

결론

언어 표현 학습에서 실제 데이터와 까다로운 negative samples를 구별하는 "discriminative task" 는 기존의 Generative 접근법 보다 "1. more compute-efficient" 하고 "2. parameter-efficient" 하다.

용어

corrupt the input (ex) BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens)
Most current pre-training methods require large amounts of compute to be effective, raising concerns about their cost and accessibility.

sallyy1 / NLU-NLP

[ELECTRA: Pre-Training Text Encoders as Discriminatiors rather than Generators (2020)] #19