modulabs / beyondBERT

11.5기의 beyondBERT의 토론 내용을 정리하는 repository입니다.
MIT License
60 stars 6 forks source link

TinyBERT: Distilling BERT for Natural Language Understanding #22

Closed seopbo closed 3 years ago

seopbo commented 4 years ago

어떤 내용의 논문인가요? 👋

Transformer-based 모델(특히, PLM)을 위한 Knowledge Distillation 방법

Abstract (요약) 🕵🏻‍♂️

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, the pre-trained language models are usually computationally expensive and memory intensive, so it is difficult to effectively execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we firstly propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be well transferred to a small “student” TinyBERT. Moreover, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT.

TinyBERT1 is empirically effective and achieves more than 96% the performance of teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT is also significantly better than state-of-the-art baselines on BERT distillation, with only ∼28% parameters and ∼31% inference time of them.

이 논문을 읽어서 무엇을 배울 수 있는지 알려주세요! 🤔

📎 Limitation of PLM

Deploy on Edge Device의 어려움.

PLM의 Redundancy 문제

redundancy in attention

스크린샷 2020-08-21 오전 11 40 21

redundancy in layer

스크린샷 2020-08-21 오전 11 40 37

PLM의 성능을 유지 or 개선 하면서 모델 사이즈를 충분히 줄일 수 있음.

📎 Model Compression Techniques

다른 model compression 기법들이 성능 하락을 최소화하는 선에서 모델 사이즈를 줄이는 방법이라면, KD(Knowledge Distillation)의 경우 TN이 학습한 feature representation을 SN에 효과적으로 전달하는 방식으로 학습을 한다는 점에서 더 안정적인 방법이라고 생각함.

KD(Knowledge Distillation) for BERT

스크린샷 2020-08-21 오후 1 54 17

Main Contribution 1) We propose a new Transformer distillation method to encourage that the linguistic knowledge encoded in teacher BERT can be well transferred to TinyBERT. 2) We propose a novel two-stage learning framework with performing the proposed Transformer distillation at both the pre-training and fine-tuning stages, which ensures that TinyBERT can capture both the general-domain and task-specific knowledge of the teacher BERT. 3) We show experimentally that our TinyBERT can achieve more than 96% the performance of teacher BERTBASE on GLUE tasks, while having much fewer parameters (∼13.3%)and less inference time (∼10.6%), and significantly outperforms other state-of-the-art baselines on BERT distillation.

📎 TinyBERT's Model Compression Method

1) Transformer Distillation

Transformer-based Model에 적합한 distillation 방법을 제안함.

스크린샷 2020-08-21 오후 4 39 50

TinyBERT는 SN의 loss함수를 다음과 같이 정의함.

스크린샷 2020-08-21 오후 5 59 41 스크린샷 2020-08-21 오후 5 59 31

Transformer-layer Distillation

스크린샷 2020-08-21 오후 5 28 16

attention based distillation

스크린샷 2020-08-21 오후 5 29 44

hidden states based distillation

스크린샷 2020-08-21 오후 5 29 51

embedding-layer distillation

스크린샷 2020-08-21 오후 5 55 23

prediction-layer distillation

스크린샷 2020-08-21 오후 5 55 33

2) TinyBERT Learning

스크린샷 2020-08-21 오후 8 33 55

General Distillation

Task-Specific Distillation

2-stage learning의 효과

📎 Experiments

Experimental Results on GLUE

스크린샷 2020-08-21 오후 10 53 51 스크린샷 2020-08-21 오후 10 55 40

Effects of Model Size

스크린샷 2020-08-21 오후 11 27 11

Ablation Study

스크린샷 2020-08-21 오후 11 41 52

📎Conclusions and Future work

📎 느낀점

같이 읽어보면 좋을 만한 글이나 이슈가 있을까요?

레퍼런스의 URL을 알려주세요! 🔗