TinyBERT: Distilling BERT for Natural Language Understanding

어떤 내용의 논문인가요? 👋

Transformer-based 모델(특히, PLM)을 위한 Knowledge Distillation 방법

Abstract (요약) 🕵🏻‍♂️

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, the pre-trained language models are usually computationally expensive and memory intensive, so it is difficult to effectively execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we firstly propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be well transferred to a small “student” TinyBERT. Moreover, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT.

TinyBERT1 is empirically effective and achieves more than 96% the performance of teacher BERT_BASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT is also significantly better than state-of-the-art baselines on BERT distillation, with only ∼28% parameters and ∼31% inference time of them.

이 논문을 읽어서 무엇을 배울 수 있는지 알려주세요! 🤔

📎 Limitation of PLM

Deploy on Edge Device의 어려움.

BERT 이후의 다양한 PLM이 많은 NLU 문제들을 해결했지만, 큰 모델 사이즈와 느린 inference 속도로 인해 핸드폰과 같은 edge device에 deploy하는데 문제가 있다.
- (참고로, 이 논문은 Huawei의 research paper)

PLM의 Redundancy 문제

Revealing the Dark Secrets of BERT에서 BERT는 heavily overparametrized되어 있다는 문제를 발견했음.
- 실제 fine-tuning task에 따라서 특정 attention head나 특정 layer를 disable 했을 때, 성능 향상이 되는 현상을 관찰했음.

redundancy in attention

redundancy in layer

PLM의 성능을 유지 ~~or 개선~~ 하면서 모델 사이즈를 충분히 줄일 수 있음.

📎 Model Compression Techniques

quantization
- fp32 -> fp16, int8
weight pruning
- 모델 성능에 영향력이 낮은 weight 순서로 weight를 제거하는 방법.
knowledge distillation
- TN(Teacher Network)이 학습한 knowledge를 SN(Student Network)로 임베딩시키는 방법.

다른 model compression 기법들이 성능 하락을 최소화하는 선에서 모델 사이즈를 줄이는 방법이라면, KD(Knowledge Distillation)의 경우 TN이 학습한 feature representation을 SN에 효과적으로 전달하는 방식으로 학습을 한다는 점에서 더 안정적인 방법이라고 생각함.

KD(Knowledge Distillation) for BERT

TinyBERT와 다른 KD의 가장 큰 차이는 Transformer Encoder의 Embedding Layer와 Attention Matrices도 fitting한다는 점이다.
- 특히, Attention Matrices는 Linguistic Knowledges를 학습한다는 것이 발견됨.
- 즉, TN의 Embedding Layer와 Attention Matrices를 SN에 fitting한다는 것은 TN이 학습한 Linguistic Knowledges를 좀 더 직접적으로 SN에게 전달하기 위함.
TinyBERT는 Transformer-based model에 적합한 distillation 방식과 PLM에 적용할 수 있는 2-stage learning framework을 제안함.

Main Contribution 1) We propose a new Transformer distillation method to encourage that the linguistic knowledge encoded in teacher BERT can be well transferred to TinyBERT. 2) We propose a novel two-stage learning framework with performing the proposed Transformer distillation at both the pre-training and fine-tuning stages, which ensures that TinyBERT can capture both the general-domain and task-specific knowledge of the teacher BERT. 3) We show experimentally that our TinyBERT can achieve more than 96% the performance of teacher BERT_BASE on GLUE tasks, while having much fewer parameters (∼13.3%)and less inference time (∼10.6%), and significantly outperforms other state-of-the-art baselines on BERT distillation.

📎 TinyBERT's Model Compression Method

1) Transformer Distillation

Transformer-based Model에 적합한 distillation 방법을 제안함.

TinyBERT는 SN의 loss함수를 다음과 같이 정의함.

n=g(m)은 TN의 layer와 SN의 layer를 mapping하는 함수.
- TinyBERT는 g(m)=3 * m으로 정함.
embedding layer는 0, prediction layer는 M+1로 표현함.
람다는 m 번째 layer의 중요도를 조절하는 hyperparamter임.
- TinyBERT는 모든 lambda를 1로 정함. (이 때가 가장 잘 됨.)

Transformer-layer Distillation

attention based distillation

attention weight는 syntax, coreference information과 같이 NLU에서 중요한 요소들을 학습함.
attention을 직접적으로 distillation하는 것의 의미는 TN이 배운 Linguistic Knowledge를 좀 더 직접적으로 SN에게 전달하기 위함.
h는 attention head의 개수.
A는 unnormalized attention matrix임.
- 실제 실험을 해보니 softmax를 통과한 것 보다 더 빠르게 수렴하고, 더 좋은 성능을 보임.

hidden states based distillation

SN의 hidden size는 TN의 hidden size보다 같거나 작음.
- W_h는 SN의 hidden states를 TN의 hidden states와 동일한 space로 linear transform하여 MSE 계산을 가능하게함.
- W_h는 learnable함.

embedding-layer distillation

hidden states based distillation과 같은 방식으로 distillation함.

prediction-layer distillation

TN의 soft label에 대한 cross entropy.
Temperature t는 KD 논문에서 좀 더 soft한 probability distribution을 얻기 위함.
- 본 연구에서는 t=1로 한 것이 가장 좋았다고 함.

2) TinyBERT Learning

TinyBERT은 2-stage learning framework을 제시함. (General, Task-Specific distillation)
TinyBERT는 General distillation을 통해, 좀 더 general한 Liguistic knowledge를 학습함.
이러한 2-stage KD 덕분에 기존의 KD 방법론에 비해, 모델을 TN과 SN의 격차를 훨씬 줄일 수 있게 되었음.

General Distillation

Pretraining이 된 BERT_base (No fine-tuning), general domain corpus를 활용해서 KD를 진행함.
- 이를 통해, SN은 일반적인 corpus에 대하여 TN이 어떤식으로 informative representation을 만들어내는지 학습함으로써, general linguistic knowledge를 학습함.
General TinyBERT는 모델 사이즈가 일반적인 BERT에 비해 작기 때문에, BERT_base 보다는 상대적으로 성능이 낮음.

Task-Specific Distillation

Task-Specific Dataset에 대한 augmentation을 진행함.
- DA 방법. (word == subword일 경우 BERT_base의 most-likely predicted mask로 , 아니면 glove embedding으로 similarity)
Fine-tuned BERT를 TN으로 사용하고, Data Augmentation을 통해 augmented된 task-specific dataset에 대하여 다시 한번 KD를 수행함.

2-stage learning의 효과

General Distillation은 task-specific distillation의 좋은 initialization을 제공함. (Appendix D를 보면 BERT small로 하는 것보다 더 좋음.)
Task-specific distillation은 task-specific knowledge를 학습함으로써 성능을 향상함.

📎 Experiments

TinyBERT (L=4, H=312, A=12, param = 14.5M)

Experimental Results on GLUE

TinyBERT는 GLUE에서 비슷한 크기의 BERT_SMALL보다 평균적으로 6.3% 정도 개선된 성능을 보여줌.
- 동일한 사이즈의 모델을 처음부터 학습하는 것보다 더 큰 네트워크에서 distillation하는 것이 효과적임.
TinyBERT는 Distilled BiLSTM_SOFT보다 모델 사이즈가 조금 더 크지만 inference 속도는 더 빠름.
- transformer와 BiLSTM의 아키텍쳐적인 특성 차이
GLUE에서 가장 어려운 CoLA Task에 대하여 다른 모델들은 매우 빈약한 성능을 보여주지만, TinyBERT는 나름 선방함.
- TinyBERT의 distillation과 learning 방식이 다른 KD 알고리즘에 비해서 linguistic knowledge를 더 잘 학습하는 근거.
TinyBERT는 다른 KD알고리즘과 달리 embedding size와 hidden size를 유연하게 조절할 수 있다는 점에서 차별점이 있다.
- PKD-BERT, Distilled BERT는 BERT_BASE의 layer를 그대로 init으로 사용하기 때문에 size 조절이 어렵지만, TinyBERT는 linear transform으로 맞춰주기 때문에 hidden size를 줄일 수 있음.

Effects of Model Size

TinyBERT는 Wider and Deeper하게 쌓으면, baseline보다 일관되게 성능이 증가함.
CoLA같은 경우 모델을 깊게하는 것뿐만 아니라 넓게 만들어야지 dramatic한 성능 차이를 볼 수 있음. (why?)
TinyBERT의 4 layer 모델이 기존의 6 layer모델보다 성능이 훨씬 좋음.

Ablation Study

📎Conclusions and Future work

transformer-based distillation 방법을 처음 제시했고 다른 방법론에 비해서 월등한 성능차이를 보임.
Future work로 wider and deeper한 teacher(e.g. BERT_LARGE, XLNET_LARGE)로 부터 효과적으로 distillation하는 방법과 다른 model compression(quantization, pruning)과 함께 TinyBERT를 활용하는 방안을 연구할 것임. (왜 large로 안했을까?)

📎 느낀점

실제 비즈니스 상황에서 모든 NLU 문제를 다 잘 할 필요는 없음. 서빙된 모델이 많아야 2~3개의 downstream task를 할텐데, Transformer distillation 방법을 잘 활용하면 모델을 경량화 함과 동시에 PLM의 redundancy 문제를 해결하여 성능을 향상시킬 수 있지 않을까?

같이 읽어보면 좋을 만한 글이나 이슈가 있을까요?

Revealing the Dark Secrets of BERT (PLM의 redundancy 문제를 지적한 논문)의 3저자 블로그 글
- https://text-machine-lab.github.io/blog/2020/bert-secrets/
What Does BERT Look At? An Analysis of BERT’s Attention (attention matrices가 어떤 linguistic knowledge를 학습하는지 연구한 논문)
- https://arxiv.org/pdf/1906.04341.pdf
- fair + Stanford (매닝)

modulabs / beyondBERT