open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
4.28k stars 365 forks source link

[BUG]: the lengths of the features after FACodecEncoderV2 is not match #188

Open Mahaotian1 opened 2 months ago

Mahaotian1 commented 2 months ago

bug of FACodecEncoderV2

I have extracted prosody_feature and encoder_output from FACodecEncoderV2. It raise wrong when I use fa_decoder_v2 to extract vq codecs becaucse the lengths of prosody_feature(torch.Size([1, 20, 281])) and encoder_output(torch.Size([1, 256, 282])) is not same.

my code

wav_b = librosa.load(wav_b, sr=16000)[0] wav_b = torch.from_numpy(wav_b).float() wav_b = wav_b.unsqueeze(0).unsqueeze(0) enc_out_b = fa_encoder_v2(wav_b) prosody_b = fa_encoder_v2.get_prosody_feature(wav_b) vq_post_emb_b, vq_idb, , quantized, spk_embs_b = fa_decoder_v2( enc_out_b, prosody_b, eval_vq=False, vq=True )

bug

File "/home/data/mahaotian/Amphion/models/codec/ns3_codec/inference_codc.py", line 129, in vq_post_emb_a, vq_ida, , quantized, spk_embs_a = fa_decoder_v2( File "/home/data/mahaotian/anaconda3/envs/vallex/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/data/mahaotian/anaconda3/envs/vallex/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, **kwargs) File "/home/data/mahaotian/Amphion/models/codec/ns3_codec/facodec.py", line 1086, in forward outs, qs, commit_loss, quantized_buf = self.quantize( File "/home/data/mahaotian/Amphion/models/codec/ns3_codec/facodec.py", line 1048, in quantize outs += out RuntimeError: The size of tensor a (281) must match the size of tensor b (282) at non-singleton dimension 2

HeCheng0625 commented 2 months ago

Hi, you need padding your wav length to multiples of 200 (hopsize)