Closed JohnHerry closed 3 years ago
Those two mel-spectrogram may be have different length
They will be the same length for the example in the paper.
get the 'Pixel-wise' picture?
pixel_wise_diff = torch.nn.functional.l1_loss(spectrogram1, spectrogram2, reduction='none')
Is there any tools for that?
you can use plot_spectrogram
to plot the output
https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/utils.py#L10-L19
and tensorboard to view the plot.
or plot the pixel_wise_diff in a notebook.
Those two mel-spectrogram may be have different length
They will be the same length for the example in the paper.
get the 'Pixel-wise' picture?
pixel_wise_diff = torch.nn.functional.l1_loss(spectrogram1, spectrogram2, reduction='none')
Is there any tools for that?
you can use
plot_spectrogram
to plot the output https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/utils.py#L10-L19and tensorboard to view the plot.
or plot the pixel_wise_diff in a notebook.
Thank you for reply. I want to measure the difference between GT speech and HifiGAN generated audio. But I found that the two wavform are of different size. and "sizeof(GT audio) - sizeof(gen audio)" is not a constant value. so I am not sure how to compare them, even in the mel-spectrogram domain.
https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/train.py#L145
This line would crash if the model outputs a different length. Maybe a rounding problem with the audio you're using?
look at the training code for a working example, maybe you can figure something out from there. https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/train.py#L122-L124
My HifiGAN models are trained with 16K fr samples. the config.json changed nothing except the "sampling_rate". Audio files generated with the inference.py are sound good. But file size generated are smaller then corresponding ground truth.
https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/train.py#L145
This line would crash if the model outputs a different length. Maybe a rounding problem with the audio you're using?
look at the training code for a working example, maybe you can figure something out from there. https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/train.py#L122-L124
Thanks for your help. I get it.
In the HifiGAN paper, figure3 shows the differ between generated waveform-mel and Tacotron2 generated mel. Those two mel-spectrogram may be have different length, then how to padding the two mel-sequence to make substraction and to get the 'Pixel-wise' picture? Is there any tools for that?