Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
I am looking at that Model code of 2 folders facodec and ns3_facodec. I know that ns3_facodec is the training code for Facodec. However, I am witnessing some differences between 2 architecture:
First of all, there are no LSTMs in the official Facodec in both Encoder and Decoder
Secondly, the timbre encoder is kinda different. Even though both are using Transformer, I am seeing that they are not the same.
The generator loss is the combination of multiple losses by some weights. But as I look at the NaturalSpeech3 paper at the Appendix part, it is clearly that the weights are not like in the paper, rather than the DAC paper
The upsample and downsample rates are not the same. For the official Ns3_codec, it is [2, 4, 5, 5] while the other one is [2,4, 8, 8]. This also means the hop_lengths for melspectrogram are 200 and 300, respectively
In the training code, the audio data has sampling rate of 24k Hz while the original paper performs on 16k Hz audio
I am looking at that Model code of 2 folders
facodec
andns3_facodec
. I know thatns3_facodec
is the training code for Facodec. However, I am witnessing some differences between 2 architecture: