open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
4.28k stars 364 forks source link

Doubt in CE-RVQ loss when training NaturalSpeech 2 #222

Open shreeshailgan opened 3 weeks ago

shreeshailgan commented 3 weeks ago

When training the NS2 model, for calculating the CE-RVQ loss, we have the diff_ce_loss method: https://github.com/open-mmlab/Amphion/blob/d33551476d792e608c13cec1bfa32283c868a2fb/models/tts/naturalspeech2/ns2_loss.py#L65 This function takes the ground truth indices gt_indices and the predicted distribution pred_dist For the gt_indices, we can pass the loaded code tensor directly. Instead what is being passed is the code reconstructed from the ground truth latent x0 https://github.com/open-mmlab/Amphion/blob/d33551476d792e608c13cec1bfa32283c868a2fb/models/tts/naturalspeech2/ns2_trainer.py#L464 The ground truth latent x0 is itself inferred earlier from the loaded code tensor https://github.com/open-mmlab/Amphion/blob/d33551476d792e608c13cec1bfa32283c868a2fb/models/tts/naturalspeech2/ns2_trainer.py#L436

Now, ideally, the reconstructed code should match the loaded ground truth code. However, in practice I've observed that the codes are different - they only match roughly 25% of the time. This is not a major issue per se, since if you just decode the codes and listen to the wavs, there is no perceptible difference. However, I was wondering. Even if the reconstruction gave an exact match, what is the need to reconstruct? Why not pass the original code directly? Am I missing something?

Thanks.

chazo1994 commented 3 weeks ago

@shreeshailgan How do you generate the code during preprocess. If you use "extract_encodec_token" function, you have to change the target bandwidth to 12.0 to match the neuralspeech2 model.

Yeah can you share the code for duration preparation for LibriTTS dataset?

shreeshailgan commented 3 weeks ago

@chazo1994 I am just directly running the code from encodec's documentation to extract codes. For duration preparation, you can take a look at the preprocessing script of FastSpeech 2