In baseline model, speed-up techniques obtain 4.4x speed over baseline decoder
Propose (RNN + FC) decoder that achieves almost SOTA with 100 words/sec decoding speed with single-threaded CPU
Details
In this work, we consider a production scenario which requires low-latency, high-throughput NMT decoding. We focus on CPU-based decoders, since GPU/FPGA/ASIC-based decoders require specialized hardware deployment and logistical constraints such as batch processing.
Baseline Decoder
written in C++
use sampled softmax ~500
Intel MKL library for matmul
Early stopping for beam search when top partial hypothesis has a log-score of 3.0 worse than the best completed sentence
batch matmul applied whenever possible
Decoder Speed Improvments
16-bit MatMul
CPU libraries usually support 32-bit floating point (MKL), but we implemented 16-bit integer with offline preprocessing of weight matrix
faster than 32-bit float
Pre-Compute Embeddings
embedding of src and tgt for k(=8,000) most frequent words can be pre-calculated and accessed via lookup table during inference for faster decoding
95% of token coverage with only 10% of top token
Pre-Compute Attention
refactor attention matmul for less multiplications per sentence
SSE & Lookup Tables
for element-wise vector functions in GRU, we can use vectorized instructions (SSE/AVX) for add, multiply functions, and lookup tables for sigmoid, tanh
Merge Recurrent States
in GRU, h_(i-1) is equal when last two tokens emitted are same, hence only use unique h_(i-1) vectors in decoding
Speed-up Result
4.4x overall
16-bit is most significant improvement
but this is with simple 1-layer decoder model, need to show speedup in production-level accuracy model
Model Improvements
Instead of using fully RNN model which is computationally heavy, we propose (RNN + multiple FC layers stacked with Residual) model
Result : competitive BLEU with GNMT and fast decoding speed
Personal Thoughts
How fast is this?
nTransformer enko.l2 : 108.53 words / sec
Ensemble of 2x Model (S4) : 102 words / sec with single CPU is EXTREMELY fast!
Improvements we can do
pre-computing embeddings
early stopping for beam search
Improvements we cannot do
use 16-bit : difficult to implement whole new integer-bit library
Not sure how we can implement
Pre-compute Attention
SSE & Lookup Table
Merge Recurrent States
Good Reference for our internal nTransformer 고도화 Paper!!
Abstract
Details
In this work, we consider a production scenario which requires low-latency, high-throughput NMT decoding. We focus on CPU-based decoders, since GPU/FPGA/ASIC-based decoders require specialized hardware deployment and logistical constraints such as batch processing.
Baseline Decoder
Decoder Speed Improvments
add, multiply
functions, and lookup tables forsigmoid, tanh
h_(i-1)
is equal when last two tokens emitted are same, hence only use uniqueh_(i-1)
vectors in decodingModel Improvements
Personal Thoughts
108.53 words / sec
102 words / sec
with single CPU is EXTREMELY fast!Link : https://arxiv.org/pdf/1705.01991.pdf Authors : Devlin et al. 2017