Closed splinter21 closed 1 year ago
@splinter21 Please take a close look at the experimental sections of these two papers 😂. In the first Table you posted, the WERs of Large models are fine-tuned on 960h data. While in the second Table, the WERs of Large models are fine-tuned on 100h data. You can find Table 9 in Appendix E of SpeechUT paper which matches the results of Table 1 in SpeechLM paper.
Hope the above information helps you.
@splinter21 Please take a close look at the experimental sections of these two papers 😂. In the first Table you posted, the WERs of Large models are fine-tuned on 960h data. While in the second Table, the WERs of Large models are fine-tuned on 100h data. You can find Table 9 in Appendix E of SpeechUT paper which matches the results of Table 1 in SpeechLM paper.
Hope the above information helps you.
Thanks! You dispelled my doubts. Microsoft published lots of SSL papers in the recent months, and after comparing I find maybe SpeechUT is the best in ASR task(up to now).
Fig1 is from SpeechLM(https://arxiv.org/pdf/2209.15329.pdf), Fig2 is from SpeechUT(https://arxiv.org/pdf/2210.03730.pdf)
I notice that the WER is the same in Base Size group, but is different in Large Size group.
Why?