reazon-research / ReazonSpeech

Massive open Japanese speech corpus
https://research.reazon.jp/projects/ReazonSpeech/
Apache License 2.0
239 stars 18 forks source link

CER on TEDxJP #11

Closed akuzeee closed 1 year ago

akuzeee commented 1 year ago

Hi, Thank you very much for publishing the large Japanese corpus.

Unfortunately, in my experiment, CER on TEDxJP corpus forESPnet ReazonSpeech is not so good (around 30%; while ESPnet LaboroTVSpeech has around 13%).

Qualitatively, for example, one can conform that recognition results for the following samples (from TEDxJP corpus) are not good in Hosted inference API of hugging_face:

Of course, there might be some errors in my implementation and hyperparameter settings, so I would like to hear about any your experience on TEDxJP corpus and tips regarding hyperparameters.

euyniy commented 1 year ago

Hi,

Thank you for your interest in our project and sharing your experience with us!

In our experience with TEDxJP, we found the dataset have recordings whose speech starts abruptly (sometimes with the first syllable omitted). Unfortunately this type of data is not very common in our dataset and we think this could be one of the reasons why it is difficult for our Conformer-based ASR model. However, we found that for the TEDxJP dataset, simply leaving extra span(~200ms) at the beginning of the utterances' timestamps before cutting lead to significant improvements in terms of CER.

Improvements on this facet of robustness is one of our future tasks as well so any feedback on this issue is appreciated.

akuzeee commented 1 year ago

Thank you for your quick reply and helpful advice! In my previous experiments, I used a VAD model, which might worsen the CER further by truncating the silence interval. I found that simply removing the VAD model improved the CER by about 4%. I would continue to investigate this issue and would let you know if there is any update.

akuzeee commented 1 year ago

FYI

simply leaving extra span(~200ms) at the beginning of the utterances' timestamps

By following this I have improved the CER to 18% :)

euyniy commented 1 year ago

@akuzeee FYI

We conducted an experiment finetuning the released pretrained model on a small augmented dataset with randomly trimmed starting point. The CER on TEDxJP 10k is about 13.5%. :)

fujimotos commented 1 year ago

@akuzeee JFYI. Today we released a new version of ReazonSpeech model.

In combination with the new reazonspeech.transcribe() interface, we can confirm that it archives 9.18% CER on TEDxJP dataset (vs 11.10% with Whisper Large-v2).

Just let us know if there is anything you'd like to hear about regarding our research!