k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
792 stars 267 forks source link

Introduction

The icefall project contains speech-related recipes for various datasets using k2-fsa and lhotse.

You can use sherpa, sherpa-ncnn or sherpa-onnx for deployment with models in icefall; these frameworks also support models not included in icefall; please refer to respective documents for more details.

You can try pre-trained models from within your browser without the need to download or install anything by visiting this huggingface space. Please refer to document for more details.

Installation

Please refer to document for installation.

Recipes

Please refer to document for more details.

ASR: Automatic Speech Recognition

Supported Datasets

More datasets will be added in the future.

Supported Models

The LibriSpeech recipe supports the most comprehensive set of models, you are welcome to try them out.

CTC

MMI

Transducer

Whisper

If you are willing to contribute to icefall, please refer to contributing for more details.

We would like to highlight the performance of some of the recipes here.

yesno

This is the simplest ASR recipe in icefall and can be run on CPU. Training takes less than 30 seconds and gives you the following WER:

[test_set] %WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ]

We provide a Colab notebook for this recipe: Open In Colab

LibriSpeech

Please see RESULTS.md for the latest results.

Conformer CTC

test-clean test-other
WER 2.42 5.73

We provide a Colab notebook to test the pre-trained model: Open In Colab

TDNN LSTM CTC

test-clean test-other
WER 6.59 17.69

We provide a Colab notebook to test the pre-trained model: Open In Colab

Transducer (Conformer Encoder + LSTM Predictor)

test-clean test-other
greedy_search 3.07 7.51

We provide a Colab notebook to test the pre-trained model: Open In Colab

Transducer (Conformer Encoder + Stateless Predictor)

test-clean test-other
modified_beam_search (beam_size=4) 2.56 6.27

We provide a Colab notebook to test the pre-trained model: Open In Colab

Transducer (Zipformer Encoder + Stateless Predictor)

WER (modified_beam_search beam_size=4 unless further stated)

  1. LibriSpeech-960hr
Encoder Params test-clean test-other epochs devices
Zipformer 65.5M 2.21 4.79 50 4 32G-V100
Zipformer-small 23.2M 2.42 5.73 50 2 32G-V100
Zipformer-large 148.4M 2.06 4.63 50 4 32G-V100
Zipformer-large 148.4M 2.00 4.38 174 8 80G-A100
  1. LibriSpeech-960hr + GigaSpeech
Encoder Params test-clean test-other
Zipformer 65.5M 1.78 4.08
  1. LibriSpeech-960hr + GigaSpeech + CommonVoice
Encoder Params test-clean test-other
Zipformer 65.5M 1.90 3.98

GigaSpeech

Conformer CTC

Dev Test
WER 10.47 10.58

Transducer (pruned_transducer_stateless2)

Conformer Encoder + Stateless Predictor + k2 Pruned RNN-T Loss

Dev Test
greedy_search 10.51 10.73
fast_beam_search 10.50 10.69
modified_beam_search 10.40 10.51

Transducer (Zipformer Encoder + Stateless Predictor)

Dev Test
greedy_search 10.31 10.50
fast_beam_search 10.26 10.48
modified_beam_search 10.25 10.38

Aishell

TDNN LSTM CTC

test
CER 10.16

We provide a Colab notebook to test the pre-trained model: Open In Colab

Transducer (Conformer Encoder + Stateless Predictor)

test
CER 4.38

We provide a Colab notebook to test the pre-trained model: Open In Colab

Transducer (Zipformer Encoder + Stateless Predictor)

WER (modified_beam_search beam_size=4)

Encoder Params dev test epochs
Zipformer 73.4M 4.13 4.40 55
Zipformer-small 30.2M 4.40 4.67 55
Zipformer-large 157.3M 4.03 4.28 56

Aishell4

Transducer (pruned_transducer_stateless5)

1 Trained with all subsets: test
CER 29.08

We provide a Colab notebook to test the pre-trained model: Open In Colab

TIMIT

TDNN LSTM CTC

TEST
PER 19.71%

We provide a Colab notebook to test the pre-trained model: Open In Colab

TDNN LiGRU CTC

TEST
PER 17.66%

We provide a Colab notebook to test the pre-trained model: Open In Colab

TED-LIUM3

Transducer (Conformer Encoder + Stateless Predictor)

dev test
modified_beam_search (beam_size=4) 6.91 6.33

We provide a Colab notebook to test the pre-trained model: Open In Colab

Transducer (pruned_transducer_stateless)

dev test
modified_beam_search (beam_size=4) 6.77 6.14

We provide a Colab notebook to test the pre-trained model: Open In Colab

Aidatatang_200zh

Transducer (pruned_transducer_stateless2)

Dev Test
greedy_search 5.53 6.59
fast_beam_search 5.30 6.34
modified_beam_search 5.27 6.33

We provide a Colab notebook to test the pre-trained model: Open In Colab

WenetSpeech

Transducer (pruned_transducer_stateless2)

Dev Test-Net Test-Meeting
greedy_search 7.80 8.75 13.49
fast_beam_search 7.94 8.74 13.80
modified_beam_search 7.76 8.71 13.41

We provide a Colab notebook to test the pre-trained model: Open In Colab

Transducer Streaming (pruned_transducer_stateless5)

Dev Test-Net Test-Meeting
greedy_search 8.78 10.12 16.16
fast_beam_search 9.01 10.47 16.28
modified_beam_search 8.53 9.95 15.81

Alimeeting

Transducer (pruned_transducer_stateless2)

Eval Test-Net
greedy_search 31.77 34.66
fast_beam_search 31.39 33.02
modified_beam_search 30.38 34.25

We provide a Colab notebook to test the pre-trained model: Open In Colab

TAL_CSASR

Transducer (pruned_transducer_stateless5)

The best results for Chinese CER(%) and English WER(%) respectively (zh: Chinese, en: English): decoding-method dev dev_zh dev_en test test_zh test_en
greedy_search 7.30 6.48 19.19 7.39 6.66 19.13
fast_beam_search 7.18 6.39 18.90 7.27 6.55 18.77
modified_beam_search 7.15 6.35 18.95 7.22 6.50 18.70

We provide a Colab notebook to test the pre-trained model: Open In Colab

TTS: Text-to-Speech

Supported Datasets

Supported Models

Deployment with C++

Once you have trained a model in icefall, you may want to deploy it with C++ without Python dependencies.

Please refer to the document for how to do this.

We also provide a Colab notebook, showing you how to run a torch scripted model in k2 with C++. Please see: Open In Colab