Confusion/Question about SpeechT5SpeechDecoderPostnet output

Hi there,

I have a question regarding the ouput of SpeechT5SpeechDecoderPostnet.

The pretrained Speecht5Model from huggingface ('microsoft/speecht5_tts') returns an output that has shape (B, 6274,80) as the last layers it forwards through is the SpeechT5SpeechDecoderPostnet. I understand that we get 80 mel bins and that both the paper, code and huggingface mentions that the result is a mel-spectrogram - Where I'm confused is the 6274... This is the time dimension, no? But when I run 2s of 16kHz audio through the pretrained SpeechT5Processor, I get a mel-spectrogram of size (B,126,80)... I would very much appreciate it if someone could tell me what is going on here.

Sincerely, Khalil

microsoft / SpeechT5

Confusion/Question about SpeechT5SpeechDecoderPostnet output #79