Introduction

The icefall project contains speech-related recipes for various datasets using k2-fsa and lhotse.

You can use sherpa, sherpa-ncnn or sherpa-onnx for deployment with models in icefall; these frameworks also support models not included in icefall; please refer to respective documents for more details.

You can try pre-trained models from within your browser without the need to download or install anything by visiting this huggingface space. Please refer to document for more details.

Installation

Please refer to document for installation.

Recipes

Please refer to document for more details.

ASR: Automatic Speech Recognition

Supported Datasets

More datasets will be added in the future.

Supported Models

The LibriSpeech recipe supports the most comprehensive set of models, you are welcome to try them out.

CTC

TDNN LSTM CTC
Conformer CTC
Zipformer CTC

MMI

Conformer MMI
Zipformer MMI

Transducer

Conformer-based Encoder
LSTM-based Encoder
Zipformer-based Encoder
LSTM-based Predictor
Stateless Predictor

Whisper

OpenAi Whisper (We support fine-tuning on AiShell-1.)

If you are willing to contribute to icefall, please refer to contributing for more details.

We would like to highlight the performance of some of the recipes here.

yesno

This is the simplest ASR recipe in icefall and can be run on CPU. Training takes less than 30 seconds and gives you the following WER:

[test_set] %WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ]

We provide a Colab notebook for this recipe:

LibriSpeech

Please see RESULTS.md for the latest results.

Conformer CTC

	test-clean	test-other
WER	2.42	5.73

We provide a Colab notebook to test the pre-trained model:

TDNN LSTM CTC

	test-clean	test-other
WER	6.59	17.69

We provide a Colab notebook to test the pre-trained model:

Transducer (Conformer Encoder + LSTM Predictor)

	test-clean	test-other
greedy_search	3.07	7.51

We provide a Colab notebook to test the pre-trained model:

Transducer (Conformer Encoder + Stateless Predictor)

	test-clean	test-other
modified_beam_search (`beam_size=4`)	2.56	6.27

We provide a Colab notebook to test the pre-trained model:

Transducer (Zipformer Encoder + Stateless Predictor)

WER (modified_beam_search beam_size=4 unless further stated)

LibriSpeech-960hr

Encoder	Params	test-clean	test-other	epochs	devices
Zipformer	65.5M	2.21	4.79	50	4 32G-V100
Zipformer-small	23.2M	2.42	5.73	50	2 32G-V100
Zipformer-large	148.4M	2.06	4.63	50	4 32G-V100
Zipformer-large	148.4M	2.00	4.38	174	8 80G-A100

LibriSpeech-960hr + GigaSpeech

Encoder	Params	test-clean	test-other
Zipformer	65.5M	1.78	4.08

LibriSpeech-960hr + GigaSpeech + CommonVoice

Encoder	Params	test-clean	test-other
Zipformer	65.5M	1.90	3.98

GigaSpeech

Conformer CTC

	Dev	Test
WER	10.47	10.58

Transducer (pruned_transducer_stateless2)

Conformer Encoder + Stateless Predictor + k2 Pruned RNN-T Loss

	Dev	Test
greedy_search	10.51	10.73
fast_beam_search	10.50	10.69
modified_beam_search	10.40	10.51

Transducer (Zipformer Encoder + Stateless Predictor)

	Dev	Test
greedy_search	10.31	10.50
fast_beam_search	10.26	10.48
modified_beam_search	10.25	10.38

Aishell

TDNN LSTM CTC

	test
CER	10.16

We provide a Colab notebook to test the pre-trained model:

Transducer (Conformer Encoder + Stateless Predictor)

	test
CER	4.38

We provide a Colab notebook to test the pre-trained model:

Transducer (Zipformer Encoder + Stateless Predictor)

WER (modified_beam_search beam_size=4)

Encoder	Params	dev	test	epochs
Zipformer	73.4M	4.13	4.40	55
Zipformer-small	30.2M	4.40	4.67	55
Zipformer-large	157.3M	4.03	4.28	56

Aishell4

Transducer (pruned_transducer_stateless5)

1 Trained with all subsets:		test
CER	29.08

We provide a Colab notebook to test the pre-trained model:

TIMIT

TDNN LSTM CTC

	TEST
PER	19.71%

We provide a Colab notebook to test the pre-trained model:

TDNN LiGRU CTC

	TEST
PER	17.66%

We provide a Colab notebook to test the pre-trained model:

TED-LIUM3

Transducer (Conformer Encoder + Stateless Predictor)

	dev	test
modified_beam_search (`beam_size=4`)	6.91	6.33

We provide a Colab notebook to test the pre-trained model:

Transducer (pruned_transducer_stateless)

	dev	test
modified_beam_search (`beam_size=4`)	6.77	6.14

We provide a Colab notebook to test the pre-trained model:

Aidatatang_200zh

Transducer (pruned_transducer_stateless2)

	Dev	Test
greedy_search	5.53	6.59
fast_beam_search	5.30	6.34
modified_beam_search	5.27	6.33

We provide a Colab notebook to test the pre-trained model:

WenetSpeech

Transducer (pruned_transducer_stateless2)

	Dev	Test-Net	Test-Meeting
greedy_search	7.80	8.75	13.49
fast_beam_search	7.94	8.74	13.80
modified_beam_search	7.76	8.71	13.41

We provide a Colab notebook to test the pre-trained model:

Transducer Streaming (pruned_transducer_stateless5)

	Dev	Test-Net	Test-Meeting
greedy_search	8.78	10.12	16.16
fast_beam_search	9.01	10.47	16.28
modified_beam_search	8.53	9.95	15.81

Alimeeting

Transducer (pruned_transducer_stateless2)

	Eval	Test-Net
greedy_search	31.77	34.66
fast_beam_search	31.39	33.02
modified_beam_search	30.38	34.25

We provide a Colab notebook to test the pre-trained model:

TAL_CSASR

Transducer (pruned_transducer_stateless5)

The best results for Chinese CER(%) and English WER(%) respectively (zh: Chinese, en: English):	decoding-method	dev	dev_zh	dev_en	test	test_zh
greedy_search	7.30	6.48	19.19	7.39	6.66	19.13
fast_beam_search	7.18	6.39	18.90	7.27	6.55	18.77
modified_beam_search	7.15	6.35	18.95	7.22	6.50	18.70

We provide a Colab notebook to test the pre-trained model:

TTS: Text-to-Speech

Supported Datasets

Supported Models

VITS

Deployment with C++

Once you have trained a model in icefall, you may want to deploy it with C++ without Python dependencies.

Please refer to the document for how to do this.

We also provide a Colab notebook, showing you how to run a torch scripted model in k2 with C++. Please see:

k2-fsa / icefall

readme

Introduction

Installation

Recipes

ASR: Automatic Speech Recognition

Supported Datasets

Supported Models

CTC

MMI

Transducer

Whisper

TTS: Text-to-Speech

Supported Datasets

Supported Models

Deployment with C++