ASR SpeechT5 training - model predicts same output for different inputs

Hi! I am currently trying to train a SpeechT5forSpeechToText model for an ASR task from scratch. My traing goes quite well most of the time, however when i try to use the model for inference with model.generate(**input) the predicts the same output for different inputs... I'm using the huggingface implementation and I followed every step on how to train the model but I just cant find the error in my code, why my model predicts the same output for every input... Might this be a general error with the SpeechT5ForSpeechToText implementation on huggingface? Or am I doing anything wrong?? Any fast help would be really appreceated!

microsoft / SpeechT5

ASR SpeechT5 training - model predicts same output for different inputs #62