Objective evaluation - Githubissues

neoncloud / mdctGAN

Code for INTERSPEECH 2023 paper "mdctGAN: Taming transformer-based GAN for speech super-resolution with Modified DCT spectra"

Other

58 stars 5 forks source link

Objective evaluation #3

Open yxlu-0102 opened 11 months ago

yxlu-0102 commented 11 months ago

I synthesise waveforms with your official ckpt on the test set of the VCTK-Corpus-0.92, which contains the audio clips of the last 8 speakers.

I calculated the LSD and SNR scores between the generated and reference test set, but the calculated metrics are not as good as those in your paper.

Additionally, the lsd calculation in util.util.compute_metrics seems strange, the n_fft should be 2048 while your default setting is 1024.

neoncloud commented 11 months ago

Thank you for your interest in our work. Please could you elaborate on your reproduction process, including...

How you calculated the LSD and other metrics? Did you use the method we provided or another library or software?
You mentioned using the "last 8 speakers", which doesn't seem to match the test set we used. Could you please elaborate on your test set partitioning method?
If possible, could you provide your reproduction results, including file names and scores?

yxlu-0102 commented 11 months ago

I used the metric_calculator you provided but I changed the n_fft to 2048 for a fair comparison with other systems.
The systems you compared with in your paper (e.g., NU-wave2 and UDM+) used the VCTK-0.92 as the dataset, and their test set contains the last 8 speakers, so I used the same test set for a fair comparison.
For example, for the 24kHz to 48kHz experiment, the metrics I calculated are LSD of 0.72 and SNR of 25.86. Your metrics in the paper are LSD of 0.61 and SNR of 26.26.

yd8175618 commented 2 months ago

Hello, does this mode support real-time voice super-resolution. Split the long speech into multiple 16ms for processing and merge them at the output end