FastBERT: a Self-distilling BERT with Adaptive Inference Time

어떤 내용의 논문인가요? 👋

간략하게 어떤 내용의 논문인지 작성해 주세요! (짧게 1-2줄 이어도 좋아요!)

Distillation 방법과 Adaptive Inference 기술을 사용하여, 적은 accuracy loss로도 추론의 속도를 컨트롤 할 수 있는 BERT 만들었다.

Abstract (요약) 🕵🏻‍♂️

논문의 abstract 원본을 적어주세요!

Pre-trained language models like BERT have proven to be highly performant. However, they are often computationally expensive in many practical scenarios, for such heavy models can hardly be readily implemented with limited resources. To improve their efficiency with an assured model performance, we propose a novel speed-tunable FastBERT with adaptive inference time. The speed at inference can be flexibly adjusted under varying demands, while redundant calculation of samples is avoided. Moreover, this model adopts a unique self-distillation mechanism at fine-tuning, further enabling a greater computational efficacy with minimal loss in performance. Our model achieves promising results in twelve English and Chinese datasets. It is able to speed up by a wide range from 1 to 12 times than BERT if given different speedup thresholds to make a speed-performance trade-off.

이 논문을 읽어서 무엇을 배울 수 있는지 알려주세요! 🤔

이 논문을 제대로 읽었을 때 어떤 지식을 얻을 수 있을까요?

1. Introduction & Related Works

최근 2년간 label이 없는 코퍼스로 pretraining을 하고 레이블이 있는 데이터에 fine-tuning하는 방식의 NLP에서 좋은 성능 향상을 보였다.
정확도에 큰 향상을 보였음에도 불구하고, 이 모델들은 상당한 계산량을 요구하고, inference time이 상대적으로 느리기 때문에 실용성이 떨어진다.
사용성을 향상하기 위해서 다양한 방법들이 제안됐다.
- Quantization: Quantization, in mathematics and digital signal processing, is the process of mapping input values from a large set (often a continuous set) to output values in a (countable) smaller set, often with a finite number of elements. 텐서 quantization은 32 bits float type → 8 bits int type으로 바꾸는게 보통
- Pruning: 성능에 영향을 끼치지 않은 node를 제거
- Knowledge Distillation: 큰 Teacher Model로 작은 Student Model를 잘 학습시키는 것
- 추천글: https://jeongukjae.github.io/posts/the-future-of-nlp/
Knowledge distillation: PKD-BERT, TinyBERT, DistilBERT
Adaptive inference:

3. Methodology

구성

Backbone과 Branches로 구성
- Backbone: Embedding, Transformer Encoder Block, Teacher-Classifier(마지막 Encoder Block과 연결)
- Branches: Student-Classifier(각 Encoder Block과 연결, 총 L-1개 )
이런 설계의 이유는 모델 정확도와 추론 속도간의 밸런스를 조절하기 위해서다.

훈련과정

Backbone과 student-classifier에 각기 다른 훈련 스텝이 필요하다. 하나의 모듈이 훈련중이면 다른 모듈은 parameters frozen 상태로 둬야한다.
downstream inference 훈련을 위해서 3 스텝으로 진행된다.
1. the major backbone pre-training: 일반적인 BERT Pretrain과 다를게 없음, 기존의 Pretrain사용해도 무방(따라서 BERT와 비슷한 RoBERTa, ERNIE도 같이 사용할 수 있다).
2. entire back-bone fine-tuning: 각기 다른 downstream task에 대해서 major backbone과 teacher-classifier에 대해 fine-tuning 단계를 거친다.
3. self-distillation for student-classifiers: 각 student는 mutually independent하고, student의 예측 분포와 teacher의 예측 분포를 이용해 KL-Divergence를 구해서 Loss를 최소화시킨다.
  - 해당 과정에서 teacher-classifier의 output만 필요하기 때문에 꼭 labeled 데이터가 필요하지 않는다. 그래서 "self-distillation"이라고 명명한듯 하다.

추론단계(Adaptive inference)

input sequence에 대해서 student-classifier의 uncertainty는 normailized entropy로 측정할 수 있다. $p_s$ 는 output probability distribution이고, $N$ 은 클래스 개수다.
$\text{Uncertainty}=\dfrac{\sum_{i=1}^N p_s(i) \log p_s(i)}{\log \frac{1}{N}} \quad \cdots (7)$
uncertainty를 정의하는데 하나의 가정과 정의을 내렸다.
- Hypothesis 1. LUHA: the Lower the Uncertainty, the Higher the Accuracy
- Definition 1. Speed: The threshold to distinguish high and low uncertainty
LUHA 가정은 Section 4.4에서 검증 되었다.
Uncertainty와 Speed 모두 0과 1시이의 값이다.
Adaptive inference 매커니즘은 다음과 같이 서술할 수 있다.
- 각 FastBERT의 layer에서 대응되는 student-classifier는 Uncertainty라는 measure로 샘플의 label를 예측한다.
- Uncertainty를 가진 샘플들 중에서 Speed 보다 낮은 것들은 빠른 결론을 도출해내고, 그렇지 않은 것들은 다음 층으로 넘어간다.
직관적으로, 높은 Speed를 설정했을 때, 더 적은 샘플들이 깊은 층으로 넘어간다. 그 결과, 더 빠른 inference가 가능하다.

예를 들어, Speed = 0.5 인 경우, Uncertainty가 0.5 보다 낮을 때만 다음 층으로 넘어간다.

빠른 추론 case: "This book is really good!"
늦은 추론 case: "Excellent! but a bit difficult to understand"

4. Experimental result

4.1 FLOPs analysis

Floating-point operation(FLOPs)는 모델의 계산 복잡도(computational complexity)를 측정하는 도구며, 단일 프로세스에서 모델이 얼만큼 floating-point operations를 진행하는지를 말한다. 환경(CPU, GPU, TPU)에 상관없이 오직 계산 복잡도만 보여준다. 보통 모델의 FLOPs 수치가 높을 수록 inference 시간이 더 오래 걸린다. 즉, 같은 정확도라도, 낮은 FLOPs가 더 효율적이다.
Table1에서 두 구조에 대한 FLOPs를 보여주는데, Classifier 의 FLOPs가 Transformer보다 많이 가볍다는 것을 알 수 있다.

4.2 Baseline and dataset

Baseline
- BERT: Devlin et al., 2019 기본 BERT
- DistilBERT: Sanh et al., 2019, 제일 유명한 distillation 방법
Dataset
- 6개의 중국어, 6개의 영어 데이터세트를 사용했다. 하나를 제외하고 전부 sentence classification tasks로 구성했다.
- 중국어: ChnSentiCorp, Book review, Shooping review, Weibo and THUCNews, LCQMC(sentence matching task)
- 영어: Ag.News, Amz.F, DBpedia, Yahoo, Yelp.F, Yelp.P
  - Yelp.P(Yelp reviews): predicting a polarity label by considering stars 1~2 negative, 3~4 positive
  - Yelp.F(Yelp reviews): predicting full number of stars the user has given
  - Yah.A : Yahoo! Answers dataset
  - Amz.F : Amazon reviews(full score prediction)

4.3 Performance comparison

Number of Layer Blocks: 12
Number of Self-attention heads: 12
Hidden dimension: 768
Max Length: 128
FastBERT와 BERT 둘다 구글의 pre-trained 것을 사용했고, DistilBERT는 논문것 사용
Fine-tune은 AdamW 사용, learning rate는 2 x 10^{-5} warm up은 0.1
3 epoch내 에서 가장 높은 정확도를 가진 모델을 선택했다. FastBERT의 self-distillation은 learning rate를 2 x 10^{-4}까지 올리고 5 epoch동안 진행했다.

table2

table2에서 확인할 수 있듯이 대체적을 정확도 손실을 최소화 하면서 더 빠른 inference를 수행한다.

fig3

figure3에서는 각 데이터 세트마다 훈련 Speed-정확도, Speed-Speedup(몇배 더 빨라졌는지), Speedup-정확도를 보여준다.
Speed를 올릴수록 정확도 폭이 하락이 심한건 중국어 sentence matching task 데이터세트
(b) Book Review, Shopping Review 와 (e) Yelp.F, Amz.F, Yahoo 데이터 세트는 speed control를 많이 올려줘야 더 빠른 추론이 가능했는데, 리뷰 데이터 세트의 task가 어렵다는 뜻으로 이해(accuracy도 다른 task에 비해서 낮음), 다만 정확도 하락폭은 그렇게 크지 않아보임

4.4 LUHA hypothesis verification

Book Review dataset을 사용해서 LUHA: the Lower the Uncertainty, the Higher the Accuracy를 검증했다.

0   我是网络乞丐，请多多支持！
1   上次没看完是觉得闷。这一次一口气看完了，没觉得闷。且好看。
1   难以想象的探险家精神

fig4

figure4는 Student-Classifier 0, 5 그리고 Teacher-Classifier의 정확도를 Uncertainty 구간별로 센것이다(count).
- 방법: 각 Student-Classfier가 도출한 uncertainty와 predict label를 저장후에, Uncertainty 구간별로 정답을 count하고 accuracy구함
자칫 Student-Classifier가 더 높은 정확도를 보이는것이 아닌가 헷갈릴 수 있는데, Figure6 (a) 를 보면 아닌것을 알 수 있다. 왜냐면 낮은 층에서는 Uncertainty의 분포가 거의 uniform하게 형성되어서, 전체 정확도는 여전히 Teacher가 높다.

4.5 In-depth study

the distribution of exit layer
- figure5 추론 speed를 높일 수록(uncertainty가 상대적으로 높아도 판단하게끔 만듬) 거의 첫번째 layer에서 결론을 도출하는 모습을 확인할 수 있다.
the distribution of sample uncertainty
- figure6 각 speed에 따른 sample distribution, 빨간선은 Speed Threshold이며 해당 선 왼쪽부분 샘플은 바로 결론을 내림, 오른쪽 부분의 샘플은 다음 층으로 넘어감
- (a): 모든 층을 통과했을 때를 보면 높은 층(깊은 층)에서 더 결정력있는(decisive) 모습을 보여줌, 모든 샘플들이 층을 지나갈 수록 Uncertainy가 0으로 수렴
the convergence during self-distillation
- figure7 Self-Distillation 단계는 Teacher-Classifier의 offloading을 가능케하는 중요한 단계다.
  - Computation offloading is the transfer of resource intensive computational tasks to a separate processor - wikipedia
- Self-Distillation 단계에서 Accuracy는 거의 줄어들지 않으면서 FLOPs가 확연하게 줄어든 것을 확인할 수 있다.

4.6 Ablation study

table3

Book Review dataset과 Yelp.P dataset 분석
without self-distillation: 모든 classifiers(teacher와 student)를 fine-tuning단계에서 훈련
without adaptive inference: 2번째, 6번째 layer에서만 inference
self-distillation과 adaptive inference 둘다 중요한 역할을 했다고 주장

기타 고려해볼만한 것들

binary classification에 대한 sample uncertainty distribution을 보여줬는데, 다른 것은 어떻게 될지..?
보통 Backbone + Teacher-Classifier에서 3 epochs finetuning하면 저정도의 정확도가 나오는지..?
저자 GitHub Issue #7: classification labels가 엄청 많을 경우, 뚜렷한 분포가 아니면 대부분 다음 층으로 넘어가기 때문에, 더 빠른 스피드를 원한다면 top N개의 uncertainty를 사용해서 할 수도 있을 것 같다.

GitHub에서 코드 다운받아서 pretrained된거 해보려고 했지만...중국 클라우드에 올려서 접근이 안된다... 451 error... 다만 아이디어는 간단한 편이라서 금방 구현해볼 수 있을 것 같다.

같이 읽어보면 좋을 만한 글이나 이슈가 있을까요?

만약에 있다면 자유롭게 작성해 주세요!

레퍼런스의 URL을 알려주세요! 🔗

markdown 으로 축약하지 말고, 원본 링크 그대로 그냥 적어주세요!

FastBERT 논문: https://arxiv.org/abs/2004.02178
FastBERT Github: https://github.com/autoliuweijie/FastBERT
핑퐁팀 - Jeong Ukjae님의 FastBERT 리뷰: https://jeongukjae.github.io/posts/fastbert-a-self-distilling-bert-with-adaptive-inference-time/

modulabs / beyondBERT