Voice Conversion - Error with Some Mono, 16kHz, 16bit Audio

fabiocat93 commented 11 months ago

I am working on the voice conversion tutorial (https://huggingface.co/blog/speecht5) to convert some audio input into a target voice and everything is fine. Next, I try the code on my data. They are all mono, 16khz, 16bit. Most of them work fine, but for some of them I get the following error:

File "/<PATH_TO_MY_CONDA>/envs/fab/lib/python3.9/site-packages/transformers/models/speecht5/modeling_speecht5.py", line 456, in forward
    emb = emb + self.alpha * self.pe[:, : emb.size(1)]
RuntimeError: The size of tensor a (1877) must match the size of tensor b (1876) at non-singleton dimension 1

Has anybody face anything similar?

kevinjcai commented 7 months ago

I'm running into this issue too

kevinjcai commented 7 months ago

I think the error might be that your embedding is larger than he positional encoding

zhenhaoge commented 3 weeks ago

I am running into this issue too. If I change the input audio, but keep the embedding the same, this issue is gone. So it is related to the input audio, rather than the embedding. Did anyone fix it?

microsoft / SpeechT5

Voice Conversion - Error with Some Mono, 16kHz, 16bit Audio #58