Closed Gabibing closed 5 months ago
The W2V representation contains a little speaker information so I intend to learn an additional prosody(pitch information) in the prosody embedding for a pitch predictor.
If the gradient of prosody embedding is detached in pitch predictor, the prosody encoder could not learn the speaker-specific pitch information well.
During training, we already used a gt w2v represention so w2v.detach() will not be affected now.
However, for the future model, I have a plan to train the pitch predictor explicitly as you suggested.
Thanks!
The tensors in PitchPredictor are detached to ensure they do not influence the TTV result.