Reproduce ASR experiment results in Hugging Face

I try to fine tune SpeechT5-base on clean-train-100 dataset using Transformer library The problem I am encountering now is that my verification set is too good when it is other-test (wer=2.76), which makes me suspect that there may be problems with the model or method I use, but the effect of clean-test is the same as The papers are similar.

Here is the model I fine-tuned Can you help me see if there is a serious problem with the experimental part? Thanks♪(･ω･)ﾉ

microsoft / SpeechT5

Reproduce ASR experiment results in Hugging Face #59