rishikksh20 / iSTFTNet-pytorch

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform
Apache License 2.0
226 stars 47 forks source link

Predicted phase not in range [-pi .. pi], but in range [-1 .. 1] #16

Open kgoba opened 1 year ago

kgoba commented 1 year ago

The phase output of the generator currently can only range from -1 to 1, which is not enough as full phase in radians is expected later in stft.inverse() (either 0..2*pi or -pi..pi).

The paper mentions somewhat cryptically that "we apply a sine activation function to represent the periodic characteristics of the phase spectrogram", but in any regard the current implementation is faulty since it can not represent the full range of possible phases.

https://github.com/rishikksh20/iSTFTNet-pytorch/blob/ecbf0f635b36432bd3e432790326591bc86cadbc/models.py#L118

As a suggestion, either try scaling the output by 2*pi, or directly predicting sin(phase) and cos(phase) in the generator (the predicted values can be normalized by dividing both by sin(phase)**2 + cos(phase)**2).

SynthAether commented 1 year ago

This is a good point and I have looked into it on my own implementation of iSTFTNet which initially output [-pi:+pi] and then I skipped it to produce the output phase [-1 : +1]. In my case using Pi didn't make the synthesis sound better, in fact I noticed a small degradation compared to using [-1:+1] but that could have been a random luck with training. This was very puzzling. I have even went as far as making a trainable scaler so the network would learn the optimal value, which in my case stabilized at [-2.5:+2.5] but again, it was hard to hear there is an improvement. I should stress again this was tested on different but similar implementation. I don't know how this applies to Rishikesh's excellent implementation.

yl4579 commented 1 year ago

This is insanely weird. I have tried to train it by multiplying the phase by torch.pi, but it fails to converge, while using the range from -1 to 1 works very well and I could obtain human-level quality on LJSpeech when combined with AdaIN and Snake activation functions for StyleTTS 2. I have no explanation for why this is happening. It makes no sense to me. If anyone has come up with a reason please let us know.