Closed fujimotos closed 1 year ago
NLP2023の会期中に並行して回していた実験の結果が出たので共有します。
実験サマリー
項目 | 内容 |
---|---|
学習データ | ReazonSpeechコーパス・22218時間 |
モデル | ESPnet Conformer モデル |
パラメータ数 | 約90M |
データ拡張 | MUSAN + カットオフ を37.5%ずつに適用 |
主な結果
以下の数字は実験のPreliminaryな結果値です。まだ研究は進行中のステータスのため、正確な数値は追って発表します。
This write-up aims to outline things we have done to create a high-quality ESPnet Japanese ASR model, and how these efforts turned out.
Early on, we set out the following goals for ReazonSpeech v1.1:
To this end, we have taken three measures:
As discussed shortly, these measures were all effective in improving ASR performance, to varying degrees. Here are more detailed discussions:
The gold-standard audio corpora consist of pairs of single utterances and their corresponding transcription.
One of major lessons learned from ReazonSpeech v1.0 was that such training datasets do not extend well to real-world audio, which often contains more than a single utterance, possibly by multiple speakers.
To overcome the limitation, we created a "multiple-utterance" from Japanese TV show by concatenating consecutive captions. The following figure illustrates the statistical property (audio duration) of the corpus:
We trained Conformer-Transformer models with varying mixtures of single-utterance and multiple-utterance datasets, and benchmarked them against JSUT-BASIC5000 (short audio) and JSUT-Book (long audio):
Based on this result, we ended up using 50:50 mix of single/multiple-utterance datasets for model training.
We experimented with two data augmentation techniques in hope of improving the robustness of our models:
These techniques did work. Not only they lowered the CER scores on noisy test sets (like TEDx), but also on clean test sets such as JSUT.
Here is the comparison of two models trained with the same recipe (5000 hour dataset) w/ and w/o data augmentation:
Generally speaking, we observed ~1% reduction of CER on average when data augmentation applied.
To help Conformer-Transformer models to process long audio data, we implemented a VAD-like function based on CTC network outputs.
The basic technique is the same with arXiv:2002.00551, but we added a few tweaks:
Instead of introducing a threshold parameter (a.k.a. minimum blank duration), we decided to cut the audio data at the longest consecutive blanks found in a given window.
We pad each extracted audio segment using np.pad()
. We found that adding
500ms-1000ms margins improves the recognition accuracy significantly.
Compared to naive streaming method (i.e. split long audio into fixed-length segments), we observed this technique lowers CER on JSUT-Book by ~4%.
In conclusion, here is the rough overview of CER improvements by measures:
Measures | Improvements for Short Audio | Improvements for Long Audio |
---|---|---|
(1) Training with single/multiple-utterance datasets | N/A | >10% |
(2) Data augmentation using MUSAN + Cutoff | ~1% | ~1% |
(3) Implement CTC-based chunk-wise audio processing | ~1% | ~4% |
@sw005320 JFYI, I wrote a short review of enhancements we made to train a
better ESPnet Japanese ASR model (which we released as ReazonSpeech v1.1
last week).
We sincerely thank you for all the suggestions you gave us! They were super helpful!
Very cool! Can you make a PR to https://github.com/espnet/espnet/tree/master/egs2/reazonspeech/asr1 about this new training scheme? We have discussed that the result through our discussions should be open-sourced.
Can you make a PR to https://github.com/espnet/espnet/tree/master/egs2/reazonspeech/asr1 about this new training scheme?
@sw005320 We are currently working on it!
We're planning to upstream the imrpvoments as a series of PRs, so just wait for us a bit.
Now we can close this ticket as resolved.
ReazonSpeech v1.1モデルのリリース
ゴール
reazonspeech-espnet-v1.1
をリリースする。v1からの改善点
備考