jik876 / hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
MIT License
1.92k stars 506 forks source link

How to draw the "Pixel-wise difference in the mel-spectrogram domain " picture? #96

Closed JohnHerry closed 3 years ago

JohnHerry commented 3 years ago

In the HifiGAN paper, figure3 shows the differ between generated waveform-mel and Tacotron2 generated mel. Those two mel-spectrogram may be have different length, then how to padding the two mel-sequence to make substraction and to get the 'Pixel-wise' picture? Is there any tools for that?

CookiePPP commented 3 years ago

Those two mel-spectrogram may be have different length

They will be the same length for the example in the paper.

get the 'Pixel-wise' picture?

pixel_wise_diff = torch.nn.functional.l1_loss(spectrogram1, spectrogram2, reduction='none')

Is there any tools for that?

you can use plot_spectrogram to plot the output https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/utils.py#L10-L19

and tensorboard to view the plot.

or plot the pixel_wise_diff in a notebook.

JohnHerry commented 3 years ago

Those two mel-spectrogram may be have different length

They will be the same length for the example in the paper.

get the 'Pixel-wise' picture?

pixel_wise_diff = torch.nn.functional.l1_loss(spectrogram1, spectrogram2, reduction='none')

Is there any tools for that?

you can use plot_spectrogram to plot the output https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/utils.py#L10-L19

and tensorboard to view the plot.

or plot the pixel_wise_diff in a notebook.

Thank you for reply. I want to measure the difference between GT speech and HifiGAN generated audio. But I found that the two wavform are of different size. and "sizeof(GT audio) - sizeof(gen audio)" is not a constant value. so I am not sure how to compare them, even in the mel-spectrogram domain.

CookiePPP commented 3 years ago

https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/train.py#L145

This line would crash if the model outputs a different length. Maybe a rounding problem with the audio you're using?


look at the training code for a working example, maybe you can figure something out from there. https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/train.py#L122-L124

JohnHerry commented 3 years ago

My HifiGAN models are trained with 16K fr samples. the config.json changed nothing except the "sampling_rate". Audio files generated with the inference.py are sound good. But file size generated are smaller then corresponding ground truth.

JohnHerry commented 3 years ago

https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/train.py#L145

This line would crash if the model outputs a different length. Maybe a rounding problem with the audio you're using?

look at the training code for a working example, maybe you can figure something out from there. https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/train.py#L122-L124

Thanks for your help. I get it.