modulabs / beyondBERT

11.5기의 beyondBERT의 토론 내용을 정리하는 repository입니다.
MIT License
60 stars 6 forks source link

PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination #21

Closed seopbo closed 4 years ago

seopbo commented 4 years ago

어떤 내용의 논문인가요? 👋

상대적으로 덜 중요한 Word vector를 제거함으로써, GLUE Task 에서 BERT 대비 Acc의 손실은 1% 이하로 유지하면서 속도는 최대 4.5배 향상함.

Abstract (요약) 🕵🏻‍♂️

We develop a novel method, called PoWER-BERT, for improving the inference time of the popular BERT model, while maintaining the accuracy. It works by: a) exploiting redundancy pertaining to word-vectors (intermediate encoder outputs) and eliminating the redundant vectors. b) determining which word-vectors to eliminate by developing a strategy for measuring their significance, based on the self-attention mechanism. c) learning how many word-vectors to eliminate by augmenting the BERT model and the loss function. Experiments on the standard GLUE benchmark shows that PoWER-BERT achieves up to 4.5x reductionin inference time over BERT with < 1% loss in accuracy. We show that PoWER-BERT offers significantly better trade-off between accuracy and inference time compared to prior methods. We demonstrate that our method attains up to 6.8x reduction in inference time with < 1% loss in accuracy when applied over ALBERT, a highly compressed version of BERT. The code for PoWER-BERT is publicly available at https://github.com/IBM/PoWER-BERT.

이 논문을 읽어서 무엇을 배울 수 있는지 알려주세요! 🤔

1. Introduction

Motivation

Previous Work

Our Objective and Approach

2. Background

3. PoWER-BERT Scheme

Motivation

Diffusion of Information

PoWER-BERT Components

Word-vector Selection

Retention configuration

Loss Function

Training PoWER-BERT

총 3개의 스텝으로 진행 1) Fine-tunning : Pre-trained BERT 모델로 Fine-tuning 2) Configuration-search : Soft-extract Layer를 fined-tunned 모델에 추가하고 Loss function을 수정. Inference Time과 acc 사이의 적절한 trade-off를 찾기 위해 Lambda 와 retention parameter 를 학습함. 이때 LR은 10~100배까지 크게 사용. 3) Re-training : Soft-extract layer를 extract layer로 대체. 제거될 word-vector들의 숫자는 2)에서 찾은 configuration으로 결정. 어떤 word vector를 제거할지는 significance score로 결정. (CLS는 제거되지 않음)

4. Evaluation

table2

5. Conclusion

같이 읽어보면 좋을 만한 글이나 이슈가 있을까요?

경량화를 위한 다른 방법들 : https://blog.est.ai/2020/03/%EB%94%A5%EB%9F%AC%EB%8B%9D-%EB%AA%A8%EB%8D%B8-%EC%95%95%EC%B6%95-%EB%B0%A9%EB%B2%95%EB%A1%A0%EA%B3%BC-bert-%EC%95%95%EC%B6%95/

하이퍼 파라미터 : https://proceedings.icml.cc/static/paper_files/icml/2020/6722-Supplemental.pdf

레퍼런스의 URL을 알려주세요! 🔗

논문 : https://proceedings.icml.cc/static/paper_files/icml/2020/6722-Paper.pdf Github : https://github.com/IBM/PoWER-BERT BERT Architecture : https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/blocks/bert-encoder?fbclid=IwAR0YUclKFzwx-2MEqRb_X_yTePqFju2E_oLHbcmrnWTmTIxoYe5gnRgH6CE