fix: detach the tensors in PitchPredictor

The W2V representation contains a little speaker information so I intend to learn an additional prosody(pitch information) in the prosody embedding for a pitch predictor.

If the gradient of prosody embedding is detached in pitch predictor, the prosody encoder could not learn the speaker-specific pitch information well.

During training, we already used a gt w2v represention so w2v.detach() will not be affected now.

However, for the future model, I have a plan to train the pitch predictor explicitly as you suggested.

Thanks!

sh-lee-prml / HierSpeechpp

fix: detach the tensors in PitchPredictor #26