microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.21k stars 114 forks source link

Confusion/Question about SpeechT5SpeechDecoderPostnet output #79

Open Student204161 opened 6 months ago

Student204161 commented 6 months ago

Hi there,

I have a question regarding the ouput of SpeechT5SpeechDecoderPostnet.

The pretrained Speecht5Model from huggingface ('microsoft/speecht5_tts') returns an output that has shape (B, 6274,80) as the last layers it forwards through is the SpeechT5SpeechDecoderPostnet. I understand that we get 80 mel bins and that both the paper, code and huggingface mentions that the result is a mel-spectrogram - Where I'm confused is the 6274... This is the time dimension, no? But when I run 2s of 16kHz audio through the pretrained SpeechT5Processor, I get a mel-spectrogram of size (B,126,80)... I would very much appreciate it if someone could tell me what is going on here.

Sincerely, Khalil