Unable to reproduce results

charliezjw commented 1 year ago

Hi,

It's fantastic work with excellent samples shown on the demo page.

But when I try to reproduce some results, it actually sounds metallic. I am using the checkpoint file you released "DaftExprt_LJ_ESD_22kHz", building the exact docker environment as you provided.

I did not change any code, but the TTS output sounds as follows, very metallic (I converted it to .mp4 so GitHub can support it):

https://github.com/ubisoft/ubisoft-laforge-daft-exprt/assets/3964282/026974b2-886c-43d2-80e0-7f2e52f0b7b5

https://github.com/ubisoft/ubisoft-laforge-daft-exprt/assets/3964282/80e05d23-20b1-4ba5-96b9-1b98c81fcf02

https://github.com/ubisoft/ubisoft-laforge-daft-exprt/assets/3964282/ec189b4d-7821-4162-9d57-b54cf3426799

https://github.com/ubisoft/ubisoft-laforge-daft-exprt/assets/3964282/2814cead-0c32-478b-8766-0ed96a2cd9c9

https://github.com/ubisoft/ubisoft-laforge-daft-exprt/assets/3964282/5afd9dc3-acd2-4391-b9a9-ef1c4fb95a16

I am also attaching the original .wav files here just in case: Daft_debug_samples.zip

Did I do something wrong? Or missed some steps?

Thank you very much! Charlie

julianzaidi commented 1 year ago

Hi Charlie,

You did nothing wrong, the generation script provided in the repository uses Griffin-Lim to convert the generated mel-spectrogram to an audio waveform. It is Griffin-Lim that gives this metallic effect. To aim for better audio fidelity, you could use a vocoder that would replace Griffin-Lim. There exist already many pre-trained vocoders. In our paper we use a pre-trained HiFi-GAN. As an additional step, you could also fine-tune this pre-trained vocoder on Daft-Exprt predictions. Please refer to the README of the repository for more information.

julianzaidi commented 1 year ago

Closing due to inactivity

ubisoft / ubisoft-laforge-daft-exprt

Unable to reproduce results #18