ReazonSpeech v1.1モデルのリリース

fujimotos commented 1 year ago

ReazonSpeech v1.1モデルのリリース

ゴール

2023年春頃を目標にreazonspeech-espnet-v1.1 をリリースする。
音声認識の精度・ロバストネスを向上させ、さらに長時間音声の文字起こしに対応する。

v1からの改善点

ノイズに対するロバストネスを向上させる #4
複数の発話が含まれる音声に対応する #5
長時間音声の文字起こしをサポートする #6

備考

それぞれの改善はGitHub Issueを利用して、公開の場で進める。
ESPnetプロジェクトに還元可能な点は可能な限りフィードバックする。

fujimotos commented 1 year ago

NLP2023の会期中に並行して回していた実験の結果が出たので共有します。

今回、ReazonSpeech v1 (15kh) 比で、学習データの量を約1.5倍に拡大しています。
主なポイントとして、データ拡張をアグレッシブに適用し、75%にAugmentを施しました。
さらに、今回からWhisperと同じように、推論前に入力音声のパディングを行うようにしています。

実験サマリー

項目	内容
学習データ	ReazonSpeechコーパス・22218時間
モデル	ESPnet Conformer モデル
パラメータ数	約90M
データ拡張	MUSAN + カットオフを37.5%ずつに適用

主な結果

以下の数字は実験のPreliminaryな結果値です。まだ研究は進行中のステータスのため、正確な数値は追って発表します。

inhouse

fujimotos commented 1 year ago

Training a better Japanese ASR model with ESPnet

This write-up aims to outline things we have done to create a high-quality ESPnet Japanese ASR model, and how these efforts turned out.

Early on, we set out the following goals for ReazonSpeech v1.1:

Add support for long-form audio
- Decode audio data longer than >15s without degrading accuracy.
Improve robustness & accuracy
- Archive >1% drop in CER on standard Japanese ASR benchmarks.

To this end, we have taken three measures:

Training with single/multiple-utterance datasets.
Data augmentation using MUSAN + Cutoff.
Implement CTC-based chunk-wise audio processing.

As discussed shortly, these measures were all effective in improving ASR performance, to varying degrees. Here are more detailed discussions:

(Measure 1) Training with single/multiple-utterance datasets

The gold-standard audio corpora consist of pairs of single utterances and their corresponding transcription.

One of major lessons learned from ReazonSpeech v1.0 was that such training datasets do not extend well to real-world audio, which often contains more than a single utterance, possibly by multiple speakers.

To overcome the limitation, we created a "multiple-utterance" from Japanese TV show by concatenating consecutive captions. The following figure illustrates the statistical property (audio duration) of the corpus:

duration

We trained Conformer-Transformer models with varying mixtures of single-utterance and multiple-utterance datasets, and benchmarked them against JSUT-BASIC5000 (short audio) and JSUT-Book (long audio):

hybrid

Based on this result, we ended up using 50:50 mix of single/multiple-utterance datasets for model training.

(Measure 2) Data augmentation using MUSAN + Cutoff

We experimented with two data augmentation techniques in hope of improving the robustness of our models:

Overlay randomly-chosen noise recordings from MUSAN.
Cut random leading segments from audio files.

These techniques did work. Not only they lowered the CER scores on noisy test sets (like TEDx), but also on clean test sets such as JSUT.

Here is the comparison of two models trained with the same recipe (5000 hour dataset) w/ and w/o data augmentation:

augment

Generally speaking, we observed ~1% reduction of CER on average when data augmentation applied.

(Measure 3) Implement CTC-based chunk-wise audio processing

To help Conformer-Transformer models to process long audio data, we implemented a VAD-like function based on CTC network outputs.

The basic technique is the same with arXiv:2002.00551, but we added a few tweaks:

Instead of introducing a threshold parameter (a.k.a. minimum blank duration), we decided to cut the audio data at the longest consecutive blanks found in a given window.
We pad each extracted audio segment using np.pad(). We found that adding 500ms-1000ms margins improves the recognition accuracy significantly.

Compared to naive streaming method (i.e. split long audio into fixed-length segments), we observed this technique lowers CER on JSUT-Book by ~4%.

Conclusion

In conclusion, here is the rough overview of CER improvements by measures:

Measures	Improvements for Short Audio	Improvements for Long Audio
(1) Training with single/multiple-utterance datasets	N/A	>10%
(2) Data augmentation using MUSAN + Cutoff	~1%	~1%
(3) Implement CTC-based chunk-wise audio processing	~1%	~4%

fujimotos commented 1 year ago

@sw005320 JFYI, I wrote a short review of enhancements we made to train a better ESPnet Japanese ASR model (which we released as ReazonSpeech v1.1 last week).

Release Note: https://github.com/reazon-research/ReazonSpeech/releases/tag/v1.1.0
Hugging Face: https://huggingface.co/reazon-research/reazonspeech-espnet-next
Blog: https://research.reazon.jp/blog/2023-04-04-ReazonSpeech.html

We sincerely thank you for all the suggestions you gave us! They were super helpful!

sw005320 commented 1 year ago

Very cool! Can you make a PR to https://github.com/espnet/espnet/tree/master/egs2/reazonspeech/asr1 about this new training scheme? We have discussed that the result through our discussions should be open-sourced.

fujimotos commented 1 year ago

Can you make a PR to https://github.com/espnet/espnet/tree/master/egs2/reazonspeech/asr1 about this new training scheme?

@sw005320 We are currently working on it!

We're planning to upstream the imrpvoments as a series of PRs, so just wait for us a bit.

fujimotos commented 1 year ago

Now we can close this ticket as resolved.

reazon-research / ReazonSpeech