Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU

Abstract

Propose fast beam search decoding in CPU for NMT
In baseline model, speed-up techniques obtain 4.4x speed over baseline decoder
Propose (RNN + FC) decoder that achieves almost SOTA with 100 words/sec decoding speed with single-threaded CPU

Details

In this work, we consider a production scenario which requires low-latency, high-throughput NMT decoding. We focus on CPU-based decoders, since GPU/FPGA/ASIC-based decoders require specialized hardware deployment and logistical constraints such as batch processing.
Baseline Decoder
- written in C++
- use sampled softmax ~500
- Intel MKL library for matmul
- Early stopping for beam search when top partial hypothesis has a log-score of 3.0 worse than the best completed sentence
- batch matmul applied whenever possible
Decoder Speed Improvments
- 16-bit MatMul
- CPU libraries usually support 32-bit floating point (MKL), but we implemented 16-bit integer with offline preprocessing of weight matrix
- faster than 32-bit float
- Pre-Compute Embeddings
- embedding of src and tgt for k(=8,000) most frequent words can be pre-calculated and accessed via lookup table during inference for faster decoding
- 95% of token coverage with only 10% of top token
- Pre-Compute Attention
- refactor attention matmul for less multiplications per sentence
- SSE & Lookup Tables
- for element-wise vector functions in GRU, we can use vectorized instructions (SSE/AVX) for add, multiply functions, and lookup tables for sigmoid, tanh
- Merge Recurrent States
- in GRU, h_(i-1) is equal when last two tokens emitted are same, hence only use unique h_(i-1) vectors in decoding
- Speed-up Result
- 4.4x overall
- 16-bit is most significant improvement
- but this is with simple 1-layer decoder model, need to show speedup in production-level accuracy model
Model Improvements
- Instead of using fully RNN model which is computationally heavy, we propose (RNN + multiple FC layers stacked with Residual) model
- Result : competitive BLEU with GNMT and fast decoding speed

Personal Thoughts

How fast is this?
- nTransformer enko.l2 : 108.53 words / sec
- Ensemble of 2x Model (S4) : 102 words / sec with single CPU is EXTREMELY fast!
Improvements we can do
- pre-computing embeddings
- early stopping for beam search
Improvements we cannot do
- use 16-bit : difficult to implement whole new integer-bit library
Not sure how we can implement
- Pre-compute Attention
- SSE & Lookup Table
- Merge Recurrent States
Good Reference for our internal nTransformer 고도화 Paper!!
One person writing this paper..wow

Link : https://arxiv.org/pdf/1705.01991.pdf Authors : Devlin et al. 2017

kweonwooj / papers

Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU #68

Abstract

Details

Personal Thoughts