microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.09k stars 113 forks source link

Voice Conversion - Error with Some Mono, 16kHz, 16bit Audio #58

Open fabiocat93 opened 11 months ago

fabiocat93 commented 11 months ago

I am working on the voice conversion tutorial (https://huggingface.co/blog/speecht5) to convert some audio input into a target voice and everything is fine. Next, I try the code on my data. They are all mono, 16khz, 16bit. Most of them work fine, but for some of them I get the following error:

File "/<PATH_TO_MY_CONDA>/envs/fab/lib/python3.9/site-packages/transformers/models/speecht5/modeling_speecht5.py", line 456, in forward
    emb = emb + self.alpha * self.pe[:, : emb.size(1)]
RuntimeError: The size of tensor a (1877) must match the size of tensor b (1876) at non-singleton dimension 1

Has anybody face anything similar?

kevinjcai commented 7 months ago

I'm running into this issue too

kevinjcai commented 7 months ago

I think the error might be that your embedding is larger than he positional encoding

zhenhaoge commented 3 weeks ago

I am running into this issue too. If I change the input audio, but keep the embedding the same, this issue is gone. So it is related to the input audio, rather than the embedding. Did anyone fix it?