v-iashin / SpecVQGAN

Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
https://v-iashin.github.io/SpecVQGAN
MIT License
347 stars 40 forks source link

about training vocoder #15

Closed yangdongchao closed 2 years ago

yangdongchao commented 2 years ago

Hi, I have a problem about training mel-gan. I find that when you train mel-gan, you normalize the audio data before transfer it to mel spectrum. e.g. In the file vocoder/mel2wav/dataset.py. def load_wav_to_torch(self, full_path): data = np.load(full_path) data = 0.95 * normalize(data)

I just want to know why you try to nomalize it and the mutiply 0.95? After the nomalization operation, the extracted mel-spectrum is same as the orginal spectrum? I mean such operation whether influence the results when we use it to transfer the predicted specrum into wave?

Furthermore, when I use your script vocoder/scripts/generate_from_folder.py to generate sample, I find it fails (It means that the reverse audio is far from the orginal audio). After that I modify it as followwing: It works `def main(): args = parse_args() vocoder = MelVocoder(args.load_path)

args.save_path.mkdir(exist_ok=True, parents=True)

for i, fname in tqdm(enumerate(args.folder.glob("*.wav"))):
    wavname = fname.name
    wav, sr = librosa.core.load(fname)
    data = 0.95 * normalize(wav) # 
    #wav = torch.from_numpy(wav).unsqueeze(0)
    #mel = vocoder(torch.from_numpy(wav)[None])
    mel = wav2mel(wav)
    # print('mel ',mel.shape)
    # assert 1==2
    recons = vocoder.inverse(mel).squeeze().cpu().numpy()

    librosa.output.write_wav(args.save_path / wavname, recons, sr=sr)`
v-iashin commented 2 years ago

Hi, thanks for your issue! However, I think these questions should be addressed to the authors of MelGAN.

why you try to nomalize it and the mutiply 0.95?

This is a good question. Since we use the original MelGAN implementation, I think your question should be addressed to the authors of MelGAN. I am not sure why they decided to do it.

https://github.com/descriptinc/melgan-neurips/blob/6488045bfba1975602288de07a58570c7b4d66ea/mel2wav/dataset.py#L64

and it seems you are not the first one who wonders about it: https://github.com/descriptinc/melgan-neurips/issues/36

I use your script vocoder/scripts/generate_from_folder.py to generate sample

I am not sure where you need this part of the code because I don't see it anywhere. Again, you need to ask the authors of MelGAN. Sorry for the confusion. I will remove the unnecessary code from this repository.

v-iashin commented 2 years ago

Also, check this piece of code if you wonder how to reconstruct predictions of the MelGAN generator: https://github.com/v-iashin/SpecVQGAN/blob/389445808a6a8301b888fe55e2a5d27b5593cefd/vocoder/scripts/train.py#L194-L202

yangdongchao commented 2 years ago

Thanks you very much

15087581161

@. | On 4/17/2022 18:51,Vladimir @.> wrote:

Also, check this piece of code if you wonder how to reconstruct predictions of the MelGAN generator: https://github.com/v-iashin/SpecVQGAN/blob/389445808a6a8301b888fe55e2a5d27b5593cefd/vocoder/scripts/train.py#L194-L202

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>