open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
4.41k stars 373 forks source link

length mismatch for FACodecDecoderV2 #160

Closed chenjiasheng closed 5 months ago

chenjiasheng commented 5 months ago

https://github.com/open-mmlab/Amphion/blob/58dc8707dec735fdb381d351fc123bec9242b204/models/codec/ns3_codec/facodec.py#L1048

it raises when the input x is in shape torch.Size([1, 256, 583]).

for V2, prosody encoder's input is mel, while other encoder's input are still waveform. some padding/cutting is needed to ensure the two outputs have the same length.

Exception has occurred: RuntimeError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
The size of tensor a (582) must match the size of tensor b (583) at non-singleton dimension 2
  File "/home/chenjiasheng/code/amphion/Amphion/models/codec/ns3_codec/facodec.py", line 1048, in quantize
    outs += out
  File "/home/chenjiasheng/code/amphion/Amphion/models/codec/ns3_codec/facodec.py", line 1086, in forward
    outs, qs, commit_loss, quantized_buf = self.quantize(
  File "/mfa_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mfa_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/chenjiasheng/code/amphion/test_facodec_v2.py", line 58, in <module>
    vq_post_emb_b, vq_id_b, _, quantized, spk_embs_b = fa_decoder_v2(
  File "/mfa_env/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mfa_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,
RuntimeError: The size of tensor a (582) must match the size of tensor b (583) at non-singleton dimension 2
HeCheng0625 commented 5 months ago

Hi, You can pad the length of the waveform to a multiple of 200 (hop length) before inference.