Open shreeshailgan opened 3 weeks ago
@shreeshailgan How do you generate the code during preprocess. If you use "extract_encodec_token" function, you have to change the target bandwidth to 12.0 to match the neuralspeech2 model.
Yeah can you share the code for duration preparation for LibriTTS dataset?
@chazo1994 I am just directly running the code from encodec's documentation to extract codes. For duration preparation, you can take a look at the preprocessing script of FastSpeech 2
When training the NS2 model, for calculating the CE-RVQ loss, we have the
diff_ce_loss
method: https://github.com/open-mmlab/Amphion/blob/d33551476d792e608c13cec1bfa32283c868a2fb/models/tts/naturalspeech2/ns2_loss.py#L65 This function takes the ground truth indicesgt_indices
and the predicted distributionpred_dist
For thegt_indices
, we can pass the loadedcode
tensor directly. Instead what is being passed is the code reconstructed from the ground truth latentx0
https://github.com/open-mmlab/Amphion/blob/d33551476d792e608c13cec1bfa32283c868a2fb/models/tts/naturalspeech2/ns2_trainer.py#L464 The ground truth latentx0
is itself inferred earlier from the loadedcode
tensor https://github.com/open-mmlab/Amphion/blob/d33551476d792e608c13cec1bfa32283c868a2fb/models/tts/naturalspeech2/ns2_trainer.py#L436Now, ideally, the reconstructed code should match the loaded ground truth code. However, in practice I've observed that the codes are different - they only match roughly 25% of the time. This is not a major issue per se, since if you just decode the codes and listen to the wavs, there is no perceptible difference. However, I was wondering. Even if the reconstruction gave an exact match, what is the need to reconstruct? Why not pass the original
code
directly? Am I missing something?Thanks.