Same benchmark, same architecture, but the WER is differenet, why?

splinter21 commented 1 year ago

Fig1 is from SpeechLM(https://arxiv.org/pdf/2209.15329.pdf), Fig2 is from SpeechUT(https://arxiv.org/pdf/2210.03730.pdf)

I notice that the WER is the same in Base Size group, but is different in Large Size group.

Why?

zz12375 commented 1 year ago

@splinter21 Please take a close look at the experimental sections of these two papers 😂. In the first Table you posted, the WERs of Large models are fine-tuned on 960h data. While in the second Table, the WERs of Large models are fine-tuned on 100h data. You can find Table 9 in Appendix E of SpeechUT paper which matches the results of Table 1 in SpeechLM paper.

Hope the above information helps you.

splinter21 commented 1 year ago

@splinter21 Please take a close look at the experimental sections of these two papers 😂. In the first Table you posted, the WERs of Large models are fine-tuned on 960h data. While in the second Table, the WERs of Large models are fine-tuned on 100h data. You can find Table 9 in Appendix E of SpeechUT paper which matches the results of Table 1 in SpeechLM paper.

Hope the above information helps you.

Thanks! You dispelled my doubts. Microsoft published lots of SSL papers in the recent months, and after comparing I find maybe SpeechUT is the best in ASR task(up to now).

microsoft / SpeechT5

Same benchmark, same architecture, but the WER is differenet, why? #24