vvvm23 / specgrad

To be an (Unofficial) implementation of "SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping" in PyTorch
MIT License
5 stars 1 forks source link

minimum phase response #1

Open pranavmalikk opened 1 year ago

pranavmalikk commented 1 year ago

I was looking through the repo but i couldn't find any information about the minimum phase response. I'm having some trouble implementing this one in pytorch as the closest interpretation is in scipy, was wondering if you've made progress on this? The paper says this but doesn't go into much depth:

"For computation of the T-F domain filter M, we estimate the spectral envelope via cepstrum as follows. First, pseudoinverse of the mel-compression matrix is applied to c for computing the corresponding power spectrogram. Then, the rth order lifter is applied to compute the spectral envelope for each time frame. As with PriorGrad, to ensure numerical stability during training [24], we add 0.01 to the estimated envelope. The coefficients m1,k, . . . , mN,k for the kth time frame are computed from the obtained envelope with minimum phase response"

pranavmalikk commented 1 year ago

I've done a naive implementation of the minimum phase response here:

def minimum_phase(magnitude_spectrum):
    # Compute the cepstrum
    cepstrum = torch.fft.ifft(torch.log(torch.abs(magnitude_spectrum))).real

    # Create minimum-phase version
    cepstrum[int(len(cepstrum) / 2) + 1:] = 0
    minimum_phase_cepstrum = 2 * cepstrum

    # Convert back to the frequency domain
    minimum_phase_spectrum = torch.exp(torch.fft.fft(minimum_phase_cepstrum))

    return minimum_phase_spectrum

I'm executing this after i add epsilon of .01 to the envelope (normalised_spec in your calculate_tf_filter method). My next question is what do you expect the transform_noise to look like, i saw in your implementation and notebook that it is not what you expect, what are you expecting for this?

vvvm23 commented 1 year ago

i saw in your implementation and notebook that it is not what you expect, what are you expecting for this? Frankly, I can't remember. I haven't worked on this for a while and the current implementation does not produce good results. I've been meaning to come back to it.

I think there is some difference between how torch and scipy handle fft. I think the scipy version is more accurate to the original paper, which is why my torch version (apparently) produced different results to what past me expected.

Related, I had emailed the authors about the minimum phase response part, as my implementation as it stands doesn't use it (we use a zero-phase filter). They said that using minimum phase would NOT significantly affect the final result.

vvvm23 commented 1 year ago

One method I was thinking of doing was rewriting this in JAX which I suspect is what the original authors used. Then, I would have access to the accurate, hardware-accelerated functions.

However, reading the email again, one exception is that they used in-house code for the mel-inverse. So who knows what the correct result is...

pranavmalikk commented 1 year ago

When you're talking about mel-inverse are you referring to "pseudoinverse of the mel-compression matrix is applied to c for computing the corresponding power spectrogram" or "G+ is the matrix representation of the inverse STFT (iSTFT) using a dual window"

vvvm23 commented 1 year ago

The first one!

pranavmalikk commented 1 year ago

yes it seems when doing the transformation of the pinv_mel on the mel (pinv_mel_basis @ mel) results in something invalid:

image

Unless this is an expected result, I'm debugging this and hope to find a solution soon

pranavmalikk commented 1 year ago

Ok i was able to get some preliminary results after debugging the noise and I'm wondering if i've made progress. I have the following results now:

image - taking torch log on the mel spectrogram - log_mel = torch.log(mel_spectrogram + 1e-6)

image

Here i've changed the input to the transform_noise function with input of the waveform itself and after i retreive that value i multiply by the noise. transformed_noise = transform_noise(M, waveform, n_fft=params.n_fft, window_length=params.window_length, hop_length=params.hop_length, post_norm=True) transformed_noise = transformed_noise * first_noise

Also as you can see the scale is larger (which i believe helps with narrowing down the mel-spectrogram in the denoising steps)

either way i'm converting to the log-mel spectrogram and i believe this is the right direction as they mention in the paper:

image

The noise is in the https://wavegrad.github.io/specgrad/#animation, so i believe the second approach is correct