neoncloud / mdctGAN

Code for INTERSPEECH 2023 paper "mdctGAN: Taming transformer-based GAN for speech super-resolution with Modified DCT spectra"
Other
58 stars 5 forks source link

Objective evaluation #3

Open yxlu-0102 opened 11 months ago

yxlu-0102 commented 11 months ago

I synthesise waveforms with your official ckpt on the test set of the VCTK-Corpus-0.92, which contains the audio clips of the last 8 speakers.

I calculated the LSD and SNR scores between the generated and reference test set, but the calculated metrics are not as good as those in your paper.

Additionally, the lsd calculation in util.util.compute_metrics seems strange, the n_fft should be 2048 while your default setting is 1024.

neoncloud commented 11 months ago

Thank you for your interest in our work. Please could you elaborate on your reproduction process, including...

yxlu-0102 commented 11 months ago
  1. I used the metric_calculator you provided but I changed the n_fft to 2048 for a fair comparison with other systems.

  2. The systems you compared with in your paper (e.g., NU-wave2 and UDM+) used the VCTK-0.92 as the dataset, and their test set contains the last 8 speakers, so I used the same test set for a fair comparison.

  3. For example, for the 24kHz to 48kHz experiment, the metrics I calculated are LSD of 0.72 and SNR of 25.86. Your metrics in the paper are LSD of 0.61 and SNR of 26.26.

yd8175618 commented 2 months ago

Hello, does this mode support real-time voice super-resolution. Split the long speech into multiple 16ms for processing and merge them at the output end