microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.16k stars 113 forks source link

Same benchmark, same architecture, but the WER is differenet, why? #24

Closed splinter21 closed 1 year ago

splinter21 commented 1 year ago

image image

Fig1 is from SpeechLM(https://arxiv.org/pdf/2209.15329.pdf), Fig2 is from SpeechUT(https://arxiv.org/pdf/2210.03730.pdf)

I notice that the WER is the same in Base Size group, but is different in Large Size group.

Why?

zz12375 commented 1 year ago

@splinter21 Please take a close look at the experimental sections of these two papers 😂. In the first Table you posted, the WERs of Large models are fine-tuned on 960h data. While in the second Table, the WERs of Large models are fine-tuned on 100h data. You can find Table 9 in Appendix E of SpeechUT paper which matches the results of Table 1 in SpeechLM paper.

Hope the above information helps you.

splinter21 commented 1 year ago

@splinter21 Please take a close look at the experimental sections of these two papers 😂. In the first Table you posted, the WERs of Large models are fine-tuned on 960h data. While in the second Table, the WERs of Large models are fine-tuned on 100h data. You can find Table 9 in Appendix E of SpeechUT paper which matches the results of Table 1 in SpeechLM paper.

Hope the above information helps you.

Thanks! You dispelled my doubts. Microsoft published lots of SSL papers in the recent months, and after comparing I find maybe SpeechUT is the best in ASR task(up to now).