Doubt in CE-RVQ loss when training NaturalSpeech 2

shreeshailgan commented 3 weeks ago

When training the NS2 model, for calculating the CE-RVQ loss, we have the diff_ce_loss method: https://github.com/open-mmlab/Amphion/blob/d33551476d792e608c13cec1bfa32283c868a2fb/models/tts/naturalspeech2/ns2_loss.py#L65 This function takes the ground truth indices gt_indices and the predicted distribution pred_dist For the gt_indices, we can pass the loaded code tensor directly. Instead what is being passed is the code reconstructed from the ground truth latent x0 https://github.com/open-mmlab/Amphion/blob/d33551476d792e608c13cec1bfa32283c868a2fb/models/tts/naturalspeech2/ns2_trainer.py#L464 The ground truth latent x0 is itself inferred earlier from the loaded code tensor https://github.com/open-mmlab/Amphion/blob/d33551476d792e608c13cec1bfa32283c868a2fb/models/tts/naturalspeech2/ns2_trainer.py#L436

Now, ideally, the reconstructed code should match the loaded ground truth code. However, in practice I've observed that the codes are different - they only match roughly 25% of the time. This is not a major issue per se, since if you just decode the codes and listen to the wavs, there is no perceptible difference. However, I was wondering. Even if the reconstruction gave an exact match, what is the need to reconstruct? Why not pass the original code directly? Am I missing something?

Thanks.

chazo1994 commented 3 weeks ago

@shreeshailgan How do you generate the code during preprocess. If you use "extract_encodec_token" function, you have to change the target bandwidth to 12.0 to match the neuralspeech2 model.

Yeah can you share the code for duration preparation for LibriTTS dataset?

shreeshailgan commented 3 weeks ago

@chazo1994 I am just directly running the code from encodec's documentation to extract codes. For duration preparation, you can take a look at the preprocessing script of FastSpeech 2

open-mmlab / Amphion

Doubt in CE-RVQ loss when training NaturalSpeech 2 #222