Open kgoba opened 1 year ago
This is a good point and I have looked into it on my own implementation of iSTFTNet which initially output [-pi:+pi] and then I skipped it to produce the output phase [-1 : +1]. In my case using Pi didn't make the synthesis sound better, in fact I noticed a small degradation compared to using [-1:+1] but that could have been a random luck with training. This was very puzzling. I have even went as far as making a trainable scaler so the network would learn the optimal value, which in my case stabilized at [-2.5:+2.5] but again, it was hard to hear there is an improvement. I should stress again this was tested on different but similar implementation. I don't know how this applies to Rishikesh's excellent implementation.
This is insanely weird. I have tried to train it by multiplying the phase by torch.pi, but it fails to converge, while using the range from -1 to 1 works very well and I could obtain human-level quality on LJSpeech when combined with AdaIN and Snake activation functions for StyleTTS 2. I have no explanation for why this is happening. It makes no sense to me. If anyone has come up with a reason please let us know.
The phase output of the generator currently can only range from -1 to 1, which is not enough as full phase in radians is expected later in
stft.inverse()
(either 0..2*pi or -pi..pi).The paper mentions somewhat cryptically that "we apply a sine activation function to represent the periodic characteristics of the phase spectrogram", but in any regard the current implementation is faulty since it can not represent the full range of possible phases.
https://github.com/rishikksh20/iSTFTNet-pytorch/blob/ecbf0f635b36432bd3e432790326591bc86cadbc/models.py#L118
As a suggestion, either try scaling the output by 2*pi, or directly predicting sin(phase) and cos(phase) in the generator (the predicted values can be normalized by dividing both by
sin(phase)**2 + cos(phase)**2
).