Closed akuzeee closed 1 year ago
Hi,
Thank you for your interest in our project and sharing your experience with us!
In our experience with TEDxJP, we found the dataset have recordings whose speech starts abruptly (sometimes with the first syllable omitted). Unfortunately this type of data is not very common in our dataset and we think this could be one of the reasons why it is difficult for our Conformer-based ASR model. However, we found that for the TEDxJP dataset, simply leaving extra span(~200ms) at the beginning of the utterances' timestamps before cutting lead to significant improvements in terms of CER.
Improvements on this facet of robustness is one of our future tasks as well so any feedback on this issue is appreciated.
Thank you for your quick reply and helpful advice! In my previous experiments, I used a VAD model, which might worsen the CER further by truncating the silence interval. I found that simply removing the VAD model improved the CER by about 4%. I would continue to investigate this issue and would let you know if there is any update.
FYI
simply leaving extra span(~200ms) at the beginning of the utterances' timestamps
By following this I have improved the CER to 18% :)
@akuzeee FYI
We conducted an experiment finetuning the released pretrained model on a small augmented dataset with randomly trimmed starting point. The CER on TEDxJP 10k is about 13.5%. :)
@akuzeee JFYI. Today we released a new version of ReazonSpeech model.
In combination with the new reazonspeech.transcribe()
interface, we can confirm that
it archives 9.18% CER on TEDxJP dataset (vs 11.10% with Whisper Large-v2).
Just let us know if there is anything you'd like to hear about regarding our research!
Hi, Thank you very much for publishing the large Japanese corpus.
Unfortunately, in my experiment, CER on TEDxJP corpus for
ESPnet ReazonSpeech
is not so good (around 30%; whileESPnet LaboroTVSpeech
has around 13%).Qualitatively, for example, one can conform that recognition results for the following samples (from TEDxJP corpus) are not good in Hosted inference API of hugging_face:
Of course, there might be some errors in my implementation and hyperparameter settings, so I would like to hear about any your experience on TEDxJP corpus and tips regarding hyperparameters.