NeuralSP: Neural network based Speech Processing
How to install
cd tools
make KALDI=/path/to/kaldi TOOL=/path/to/save/tools
Key features
Corpus
-
ASR
- AISHELL-1
- AISHELL-2
- AMI
- CSJ
- LaboroTVSpeech
- Librispeech
- Switchboard (+Fisher)
- TEDLIUM2/TEDLIUM3
- TIMIT
- WSJ
-
LM
Front-end
- Frame stacking
- Sequence summary network [link]
- SpecAugment [link]
- Adaptive SpecAugment [link]
Encoder
- RNN encoder
- (CNN-)BLSTM, (CNN-)LSTM, (CNN-)BLGRU, (CNN-)LGRU
- Latency-controlled BRNN [link]
- Random state passing (RSP) [link]
- Transformer encoder [link]
- Chunk hopping mechanism [link]
- Relative positional encoding [link]
- Causal mask
- Conformer encoder [link]
- Time-depth separable (TDS) convolution encoder [link] [line]
- Gated CNN encoder (GLU) [link]
Connectionist Temporal Classification (CTC) decoder
- Beam search
- Shallow fusion
- Forced alignment
RNN-Transducer (RNN-T) decoder [link]
- Beam search
- Shallow fusion
Attention-based decoder
- RNN decoder
- Shallow fusion
- Cold fusion [link]
- Deep fusion [link]
- Forward-backward attention decoding [link]
- Ensemble decoding
- internal LM estimation [link]
- Attention type
- location-based
- content-based
- dot-product
- GMM attention
- Streaming RNN decoder specific
- Hard monotonic attention [link]
- Monotonic chunkwise attention (MoChA) [link]
- Delay constrained training (DeCoT) [link]
- Minimum latency training (MinLT) [link]
- CTC-synchronous training (CTC-ST) [link]
- Transformer decoder [link]
- Streaming Transformer decoder specific
- Monotonic Multihead Attention [link] [link]
Language model (LM)
- RNNLM (recurrent neural network language model)
- Gated convolutional LM [link]
- Transformer LM
- Transformer-XL LM [link]
- Adaptive softmax [link]
Output units
- Phoneme
- Grapheme
- Wordpiece (BPE, sentencepiece)
- Word
- Word-char mix
Multi-task learning (MTL)
Multi-task learning (MTL) with different units are supported to alleviate data sparseness.
- Hybrid CTC/attention [link]
- Hierarchical Attention (e.g., word attention + character attention) [link]
- Hierarchical CTC (e.g., word CTC + character CTC) [link]
- Hierarchical CTC+Attention (e.g., word attention + character CTC) [link]
- Forward-backward attention [link]
- LM objective
ASR Performance
AISHELL-1 (CER)
Model |
dev |
test |
Conformer LAS |
4.1 |
4.5 |
Transformer |
5.0 |
5.4 |
Streaming MMA |
5.5 |
6.1 |
AISHELL-2 (CER)
Model |
test_android |
test_ios |
test_mic |
Conformer LAS |
6.1 |
5.5 |
5.9 |
CSJ (WER)
Model |
eval1 |
eval2 |
eval3 |
Conformer LAS |
5.7 |
4.4 |
4.9 |
BLSTM LAS |
6.5 |
5.1 |
5.6 |
LC-BLSTM MoChA |
7.4 |
5.6 |
6.4 |
Switchboard 300h (WER)
Model |
SWB |
CH |
BLSTM LAS |
9.1 |
18.8 |
Switchboard+Fisher 2000h (WER)
Model |
SWB |
CH |
BLSTM LAS |
7.8 |
13.8 |
LaboroTVSpeech (CER)
Model |
dev_4k |
dev |
tedx-jp-10k |
Conformer LAS |
7.8 |
10.1 |
12.4 |
Librispeech (WER)
Model |
dev-clean |
dev-other |
test-clean |
test-other |
Conformer LAS |
1.9 |
4.6 |
2.1 |
4.9 |
Transformer |
2.1 |
5.3 |
2.4 |
5.7 |
BLSTM LAS |
2.5 |
7.2 |
2.6 |
7.5 |
BLSTM RNN-T |
2.9 |
8.5 |
3.2 |
9.0 |
UniLSTM RNN-T |
3.7 |
11.7 |
4.0 |
11.6 |
UniLSTM MoChA |
4.1 |
11.0 |
4.2 |
11.2 |
LC-BLSTM RNN-T |
3.3 |
9.8 |
3.5 |
10.2 |
LC-BLSTM MoChA |
3.3 |
8.8 |
3.5 |
9.1 |
Streaming MMA |
2.5 |
6.9 |
2.7 |
7.1 |
TEDLIUM2 (WER)
Model |
dev |
test |
Conformer LAS |
7.0 |
6.8 |
BLSTM LAS |
8.1 |
7.5 |
LC-BLSTM RNN-T |
8.0 |
7.7 |
LC-BLSTM MoChA |
10.3 |
8.6 |
UniLSTM RNN-T |
10.7 |
10.7 |
UniLSTM MoChA |
13.5 |
11.6 |
WSJ (WER)
Model |
test_dev93 |
test_eval92 |
BLSTM LAS |
8.8 |
6.2 |
LM Performance
Penn Tree Bank (PPL)
Model |
valid |
test |
RNNLM |
87.99 |
86.06 |
+ cache=100 |
79.58 |
79.12 |
+ cache=500 |
77.36 |
76.94 |
WikiText2 (PPL)
Model |
valid |
test |
RNNLM |
104.53 |
98.73 |
+ cache=100 |
90.86 |
85.87 |
+ cache=2000 |
76.10 |
72.77 |
Reference
Dependency