jik876 / hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
MIT License
1.83k stars 493 forks source link

Some questions #4

Open george-roussos opened 3 years ago

george-roussos commented 3 years ago

Hi, thanks for sharing the code, it is well appreciated. Some questions:

Thanks again, these results are impressive for a GAN.

Edresson commented 3 years ago

Hi, @george-roussos yes Mozilla TTS has some extra standardizations. I believe that we must adjust the HiFi-GAN with the spectrograms of Mozilla TTS. I plan to do that soon.

About the other questions I am also curious to know.

george-roussos commented 3 years ago

That's great 😀 I am trying to adjust meldataset.py right now with your WaveGrad implementation as pointers, I hope I will be able to do it.

Edresson commented 3 years ago

I believe it is not necessary. Just do the fine tunning on Mozilla TTS. Basically make a script that synthesizes all the sentences of the LJSpeech dataset or any other and save the Spectrograms generated with numpy (.npy). After passing the HiFi-GAN training script (train.py)

--fine_tuning True and --input_mels_dir Mozilla_TTS_generated_Specs

Thus, the dataloader will not extract honey Spectrograms but will use those already extracted from the TTS model.

I believe that this way is faster and needs less training (I may be wrong). If you choose to extract the Specs and train the HiFi-GAN and after fine tunning I believe it will take much longer time (2 training) and given that the spectrograms are almost the same and the Mozilla TTS has only a few more normalizations the adjustment should be able get around this without any problems.

george-roussos commented 3 years ago

I thought that too. And you can extract the spectrograms using ExtractTTSpectrogram, so I already have them. I will try that too.

Edresson commented 3 years ago

do you intend to train a universal model?

george-roussos commented 3 years ago

Yes, on LibriTTS (or, more preferably, another multispeaker set that is of more universal sound quality). But it'd have to be after my single speaker tests so I can see how it performs.

george-roussos commented 3 years ago

I tried to implement Mozilla AP by adding it as a module to the structure, along with the class and changing the getitem function:

    def __getitem__(self, index):
        filename = self.audio_files[index]
        if self._cache_ref_count == 0:
            audio, sampling_rate = T.load_wav(filename)
            if not self.fine_tuning:
                audio = torch.clamp(audio[0] / 32767.5, -1.0, 1.0)
            self.cached_wav = audio
            if sampling_rate != self.sampling_rate:
                raise ValueError("{} SR doesn't match target {} SR".format(
                    sampling_rate, self.sampling_rate))
            self._cache_ref_count = self.n_cache_reuse
        else:
            audio = self.cached_wav
            self._cache_ref_count -= 1

#        audio = torch.FloatTensor(audio)
#        audio = audio.unsqueeze(0)

        if not self.fine_tuning:
            if self.split:
                if audio.size(1) >= self.segment_size:
                    max_audio_start = audio.size(1) - self.segment_size
                    audio_start = random.randint(0, max_audio_start)
                    audio = audio[:, audio_start:audio_start+self.segment_size]
                else:
                    audio = torch.nn.functional.pad(audio, (0, self.segment_size - audio.size(1)), 'constant')

            #mel = mel_spectrogram(audio, self.n_fft, self.num_mels,
            #                      self.sampling_rate, self.hop_size, self.win_size, self.fmin, self.fmax,
            #                      center=False)

            mel = np.float32(ap.melspectrogram(audio.detach().cpu().numpy()))

but I got IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1).

jik876 commented 3 years ago

Hi, thanks for sharing the code, it is well appreciated. Some questions:

  • Do you train with mean-var normalization? If not, what is the range normalization?
  • I tried to plug in the models using a spectrogram generated by Mozilla TTS, but had no luck (waveform is generated, but sound is very distorted). Do you have any idea why this happens? Is there any difference in which the spectrograms are computed from hifiGAN's side? The training attributes (win, hop, fmin, fmax) are otherwise the same.
  • When finetuning for TTS, how do you acquire your ground truth mels? Using the TTS model you want to use the GAN with?
  • How many steps do you train for?

Thanks again, these results are impressive for a GAN.

Thanks for your interest. Please understand that the reply is late due to our other work.

CookiePPP commented 3 years ago

@jik876 Did you notice any overfitting with your largest/best models? Any areas for possible improvement to audio quality i.e sampling rate, speaker embeddings, noise embeddings?

The speakers I'm targeting are slightly noisy and I'm curious what you'd recommend; something like training on 48Khz VCTK + Blizzard2011 dataset and performing inference on the slightly noisy unseen speaker predicted spectrograms? or maybe training on the target speakers and using a speaker or noise embedding which at inference time can be set to "clean" outputs?

Your audio samples already sound great, I'm just curious what the upper limit is now. :smile:

george-roussos commented 3 years ago

Hi, thanks for sharing the code, it is well appreciated. Some questions:

  • Do you train with mean-var normalization? If not, what is the range normalization?
  • I tried to plug in the models using a spectrogram generated by Mozilla TTS, but had no luck (waveform is generated, but sound is very distorted). Do you have any idea why this happens? Is there any difference in which the spectrograms are computed from hifiGAN's side? The training attributes (win, hop, fmin, fmax) are otherwise the same.
  • When finetuning for TTS, how do you acquire your ground truth mels? Using the TTS model you want to use the GAN with?
  • How many steps do you train for?

Thanks again, these results are impressive for a GAN.

Thanks for your interest. Please understand that the reply is late due to our other work.

  • We didn't use any additional normalization after spectrogram generation in preprocessing except for clipping.
  • We used widely used libraries for spectrogram generation, and I don't think there is any special computation. I can't pinpoint it because I haven't tried it, but I think the difference in normalization and frequency range can definitely lead to quality degradation.
  • We used ground-truth mel-spectrogram from ground-truth audio. We used NVIDIA Tacotron2 with teacher-forcing to generate mel-spectrogram as input condition for training.
  • We trained the model up to 2,500k steps.

Hi, thanks a lot for the reply 😀 Me and @Edresson were able to adjust it for Mozilla TTS and it is now training. Can I ask if you meant 2 and a half million steps at the end?

jik876 commented 3 years ago

@jik876 Did you notice any overfitting with your largest/best models? Any areas for possible improvement to audio quality i.e sampling rate, speaker embeddings, noise embeddings?

The speakers I'm targeting are slightly noisy and I'm curious what you'd recommend; something like training on 48Khz VCTK + Blizzard2011 dataset and performing inference on the slightly noisy unseen speaker predicted spectrograms? or maybe training on the target speakers and using a speaker or noise embedding which at inference time can be set to "clean" outputs?

Your audio samples already sound great, I'm just curious what the upper limit is now. 😄

@CookiePPP

Thanks for your interest. 😄

We haven't seen overfitting in our experiments. It is hard to comment exactly because we have not experimented with a higher sample rate and embeddings. If you try a slightly different method, if you use a noisy spectrogram as the input condition and clean audio as ground truth for training, I think HiFi-GAN can synthesize clean audio from a noisy spectrogram. This is in line with the fine-tuning experiment in our paper. I'm not sure this answer is what you want because I don't know exactly what dataset you are using, I hope you have good results.

jik876 commented 3 years ago

Hi, thanks for sharing the code, it is well appreciated. Some questions:

  • Do you train with mean-var normalization? If not, what is the range normalization?
  • I tried to plug in the models using a spectrogram generated by Mozilla TTS, but had no luck (waveform is generated, but sound is very distorted). Do you have any idea why this happens? Is there any difference in which the spectrograms are computed from hifiGAN's side? The training attributes (win, hop, fmin, fmax) are otherwise the same.
  • When finetuning for TTS, how do you acquire your ground truth mels? Using the TTS model you want to use the GAN with?
  • How many steps do you train for?

Thanks again, these results are impressive for a GAN.

Thanks for your interest. Please understand that the reply is late due to our other work.

  • We didn't use any additional normalization after spectrogram generation in preprocessing except for clipping.
  • We used widely used libraries for spectrogram generation, and I don't think there is any special computation. I can't pinpoint it because I haven't tried it, but I think the difference in normalization and frequency range can definitely lead to quality degradation.
  • We used ground-truth mel-spectrogram from ground-truth audio. We used NVIDIA Tacotron2 with teacher-forcing to generate mel-spectrogram as input condition for training.
  • We trained the model up to 2,500k steps.

Hi, thanks a lot for the reply 😀 Me and @Edresson were able to adjust it for Mozilla TTS and it is now training. Can I ask if you meant 2 and a half million steps at the end?

@george-roussos

Yes. I meant 2,500,000 steps. However, it synthesizes high-quality audio even at an earlier step. Therefore, it is advisable to adjust the training steps as needed. I hope you have good results from the work.

george-roussos commented 3 years ago

Yes. I meant 2,500,000 steps. However, it synthesizes high-quality audio even at an earlier step. Therefore, it is advisable to adjust the training steps as needed. I hope you have good results from the work.

Thanks a lot. Actually, with your implementation I am getting the best results I have gotten using a GAN. Something that I observed was that hifi gan is much better with lower frequencies when the speaker glottalises. With all other implementations, I had problems and it sounded bad but with hifi gan it sounds much more natural. Something else that I observed is that breathing still sounds metallic when constructing a spectrogram from Taco2. However, I have not tried finetuning yet to conclude whether it helps with this, or not. Did you also notice this?

CookiePPP commented 3 years ago

@jik876

If you try a slightly different method, if you use a noisy spectrogram as the input condition and clean audio as ground truth for training, I think HiFi-GAN can synthesize clean audio from a noisy spectrogram.

Yes, that's what I was thinking. :smile: I'm doing 44Khz testing as first priority, and I'll test this noisy input spectrogram on a Colab session in parallel later.

jik876 commented 3 years ago

Yes. I meant 2,500,000 steps. However, it synthesizes high-quality audio even at an earlier step. Therefore, it is advisable to adjust the training steps as needed. I hope you have good results from the work.

Thanks a lot. Actually, with your implementation I am getting the best results I have gotten using a GAN. Something that I observed was that hifi gan is much better with lower frequencies when the speaker glottalises. With all other implementations, I had problems and it sounded bad but with hifi gan it sounds much more natural. Something else that I observed is that breathing still sounds metallic when constructing a spectrogram from Taco2. However, I have not tried finetuning yet to conclude whether it helps with this, or not. Did you also notice this?

That's good news that you're getting the best results. 😄 I'm not sure exactly where the metallic sounds you mentioned appear, but before fine-tuning, there are several quality degradation points including metallic sounds. Fine-tuning has improved them significantly in our experiments, so it will be helpful as well. I think you've already done it, but it would be nice to refer to our demo page to see if the metallic sounds you mentioned disappear after fine tuning. Thanks.

george-roussos commented 3 years ago

They do, yes 😀 LJSpeech TTS sounds much more natural after finetuning, so I am holding out hope.

Is there any intuition for training to 2.5M steps or was it because of LJSpeech's sound quality? Did you notice quality improvements after 1M steps?

jik876 commented 3 years ago

They do, yes 😀 LJSpeech TTS sounds much more natural after finetuning, so I am holding out hope.

Is there any intuition for training to 2.5M steps or was it because of LJSpeech's sound quality? Did you notice quality improvements after 1M steps?

@george-roussos

We chose 2.5M steps for a fair comparison with another GAN model. Quality improvement was observed after 1M steps, with varying degrees of improvement across the different datasets. When using the LJSpeech dataset, we observed that the quality improved very slightly as training progressed even after 2.5M steps.

jik876 commented 3 years ago

@george-roussos

We've observed that setting fmax to unlimited value improves the quality in the experiment using the LJ Speech dataset. This part is expected to be related to high frequency. If possible, I think it would be nice to experiment with a higher fmax.

CookiePPP commented 3 years ago

@jik876 I used 11025 on the 44Khz vocoder. Do you think any higher would be worthwhile? (I'm not sure myself)

george-roussos commented 3 years ago

@george-roussos

We've observed that setting fmax to unlimited value improves the quality in the experiment using the LJ Speech dataset. This part is expected to be related to high frequency. If possible, I think it would be nice to experiment with a higher fmax.

Thank you very much for the update! In what way did you notice improvements?

The problem that I have is that in TTS synthesis, the breathing sounds metallic (even after finetuning). The voice itself sounds fine, but the breathing doesn't sound good. It is probably because of the TTS itself, however it is something that I have noticed in all MelGAN variants and something that does not happen with other vocoders (ParallelWaveGAN, WaveGrad), these fix it. Have you also noticed this?

CookiePPP commented 3 years ago

@george-roussos Audio samples? (both from the original dataset, and regenerated using inference.py)

george-roussos commented 3 years ago

I cannot share samples, the speaker has not given me consent. It sounds good during eval (when training and finetuning with ground truth extracted using TTS). I do not notice it there, breathing sounds very clear. Regenerating original samples with the finetuned model also sounds good. It only happens when I try to vocode a TTS synthesized spectrogram.

jik876 commented 3 years ago

@jik876 I used 11025 on the 44Khz vocoder. Do you think any higher would be worthwhile? (I'm not sure myself)

@CookiePPP

In my opinion, it depends on the dataset, but it would be nice to experiment with a higher fmax. Because fmax lower than the maximum frequency makes partially obscured input condition, I think it may be a more difficult problem for the model to train.

CookiePPP commented 3 years ago

@george-roussos How much data do you have then? I've got the universal 44Khz HiFi-GAN trained on 8x V100s with a massive set of speakers with help from a couple of friends. It's not finished training and it's probably going to be used on my friends Website anyway so I don't have permission to share, but I'm curious how it'd go with your dataset on this early checkpoint I have. This early checkpoint sounds perfect on almost every speaker we've tested, like literally can't tell from ground truth audio. And since the spectrogram fmax is 11025, we've tried upsampling 22khz files and it makes the generated output sound better than the input! :smile:

george-roussos commented 3 years ago

Sounds good 😀 my speaker has approximately 28 hours of audio and they are very breathy, so it should not be a problem of not enough occurrences. I use an fmin of 0 and an fmax of 8000. I'd love to try 11025, but that would mean I would have to retrain my TTS 😔 In what way did you notice improvements in quality?

jik876 commented 3 years ago

@george-roussos We've observed that setting fmax to unlimited value improves the quality in the experiment using the LJ Speech dataset. This part is expected to be related to high frequency. If possible, I think it would be nice to experiment with a higher fmax.

Thank you very much for the update! In what way did you notice improvements?

The problem that I have is that in TTS synthesis, the breathing sounds metallic (even after finetuning). The voice itself sounds fine, but the breathing doesn't sound good. It is probably because of the TTS itself, however it is something that I have noticed in all MelGAN variants and something that does not happen with other vocoders (ParallelWaveGAN, WaveGrad), these fix it. Have you also noticed this?

@george-roussos

I felt that the sound in high-frequency range became more natural. In our experiments with the LJ Speech dataset, we did not feel any other problem with breathing sound beyond the characteristics of the original dataset. If you are training both Tacotron2 and HiFi-GAN, it would be nice to try various settings according to the dataset.

CookiePPP commented 3 years ago

@george-roussos

In what way did you notice improvements in quality?

between what and what

so it should not be a problem of not enough occurrences.

Yeah, I trained another 44Khz on my local system with just the 150~ speakers that I have already set up, and.. it's hard to describe but the audio files vocoded are shifted to match the recordings of those 150 speakers. The audio also has a little more fuzz, though it's almost un-noticable, it's definitely there. I don't really have any fixes for you if fine-tuning was unsuccessful. I guess I'll just say good luck and report back anything you find.

george-roussos commented 3 years ago

@jik876 I see your point. The speaker in LJSpeech has a lot of high frequency. I do not think my problem is HiFiGAN related, I am pretty confident in the quality is has shown as a vocoder; so I think it comes from Tacotron2. However, I do think it's exasperated because of the MelGAN foundation, and because vocoders like WaveGrad do not exhibit it.

@CookiePPP I will report if I have it fixed. I actually thought I would train my TTS with r=1, because I stopped at r=2, so it may help. And then switch to BN instead of original prenet. I agree with the fuziness 😀 it is there but not noticeable. Other than that, it is very good. I guess that is the reason for the small MOS in relation to GT speech. Definitely the best GAN I have tried.

youssefavx commented 3 years ago

@jik876 your earlier idea on using a noisy spectrogram as input, would that mean that hi-fi gan if fine-tuned on a speaker's clean audio input could denoise that same speaker's noisy audio?

@CookiePPP If you already made a Colab for this, would you be open to sharing? No worries if not. Your 44.1kHz results are very exciting and I'd be curious if you could share a model also, but again no worries if not.

youssefavx commented 3 years ago

I'm also curious, when fine-tuning on one speaker, how much training data do I need? Minutes? Hours? If so, how much would get me good results and then how much would guarantee best results?

CookiePPP commented 3 years ago

@youssefavx

I don't have a 44Khz notebook set up.

https://github.com/CookiePPP/VocoderComparisons/tree/main/repos/hifi-gan https://github.com/CookiePPP/cookietts/tree/experimental/CookieTTS/_4_mtw/hifigan

Code can be stolen from either of these. If you can't figure it out what you need in say 2~ days, nag me and I can write a notebook up.

CookiePPP commented 3 years ago

your earlier idea on using a noisy spectrogram as input, would that mean that hi-fi gan if fine-tuned on a speaker's clean audio input could denoise that same speaker's noisy audio?

I can also confirm the vocoder is able to upsample audio in a way that improves the audio quality. Input 22Khz audio, convert to spectrogrm and Output 44Khz, the 44Khz will sound better in 80%~ of cases than the original audio. :smile:

youssefavx commented 3 years ago

@CookiePPP Thanks so much! I'll try to figure this out soon. Very exciting to hear your results

jik876 commented 3 years ago

@jik876 your earlier idea on using a noisy spectrogram as input, would that mean that hi-fi gan if fine-tuned on a speaker's clean audio input could denoise that same speaker's noisy audio?

@CookiePPP If you already made a Colab for this, would you be open to sharing? No worries if not. Your 44.1kHz results are very exciting and I'd be curious if you could share a model also, but again no worries if not.

@youssefavx

Thanks for your interest.

Yes, that's right. Furthermore, I conducted a simple speech enhancement experiment with noisy input condition and clean ground truth audio, HiFi-GAN showed that it removes noise significantly without any other changes.

I'm also curious, when fine-tuning on one speaker, how much training data do I need? Minutes? Hours? If so, how much would get me good results and then how much would guarantee best results?

It's hard to comment about other datasets, in our experiments with the LJSpeech dataset, we fine-tuned the model up to 100k steps with all the training data except the validation data and got good results. MOS difference between ground truth and generated audio from the fine-tuned model is very small. More details can be found in our paper.

youssefavx commented 3 years ago

@jik876 Thats amazing to hear, thank you!

youssefavx commented 3 years ago

@CookiePPP Did you manage to make a Colab for this? After going back to this I only managed to run inference on colab, but am struggling to find how one generates mel spectrograms from Tacotron. Their Readme seems to only point to how to generate that from text though I could be wrong.

It'd be more user-friendly if that detail was abstracted into "put all your wav files in this folder" and then that is just taken care of automatically but I could be wrong.

te0006 commented 3 years ago

Hi, thanks a lot for the reply 😀 Me and @Edresson were able to adjust it for Mozilla TTS and it is now training.

Could you share how you achieved that "adjustment"? Would like to experiment with MozTTS+HiFiGAN as well. Thx!

kenna3 commented 3 years ago

hi guys im really trying to implement this to my tacotron 2 locally on windows but have no idea how, i just want to use it for a vocoder as hifgan is the best

sygi commented 2 years ago

Late to the party, but I also tries finetuning tacotron 2 and later HiFiGAN on another speaker (p270 from VCTK, so around 30min of finetuning data) and I'm observing great quality except for the metallic sound during breathing (example). @george-roussos did you, by any chance, have to decrease the batch size while finetuning tacotron2? I was doing this to fit the model in the free colab GPU memory and I wonder if this could have caused the problem.

v-nhandt21 commented 2 years ago

@george-roussos How much data do you have then? I've got the universal 44Khz HiFi-GAN trained on 8x V100s with a massive set of speakers with help from a couple of friends. It's not finished training and it's probably going to be used on my friends Website anyway so I don't have permission to share, but I'm curious how it'd go with your dataset on this early checkpoint I have. This early checkpoint sounds perfect on almost every speaker we've tested, like literally can't tell from ground truth audio. And since the spectrogram fmax is 11025, we've tried upsampling 22khz files and it makes the generated output sound better than the input! smile

I have some questions after all:


def get_mel_parallelwavegan(wave):
     # get amplitude spectrogram
     wave = wave / max_wav_value
     wave = wave.astype('float32')
     x_stft = librosa.stft(wave, n_fft=fft_size, hop_length=hop_size, win_length=win_length, window=window_librosa, center=True, pad_mode="reflect")
     spc = np.abs(x_stft).T  # (#frames, #bins)
     mel = np.maximum(eps, np.dot(spc, melbasis.T))
     return np.log10(mel).T

def get_mel_tacotron2(wave):
     wave = torch.FloatTensor(wave)
     audio_norm = wave / max_wav_value
     audio_norm = audio_norm.unsqueeze(0)
     audio_norm = torch.autograd.Variable(audio_norm, requires_grad=False)
     _stft = TacotronSTFT(fft_size, hop_size, fft_size, num_mels, sampling_rate, fmin, fmax)
     melspec = _stft.mel_spectrogram(audio_norm)
     melspec = torch.squeeze(melspec, 0)
     return melspec.cpu().detach().numpy()

def get_mel_hifigan(y):
     y = y/max_wav_value
     y = torch.FloatTensor([y]).to(device)
     y = torch.nn.functional.pad(y.unsqueeze(1), (int((fft_size-hop_size)/2), int((fft_size-hop_size)/2)), mode='reflect').squeeze(1)
     spec = torch.stft(y, fft_size, hop_length=hop_size, win_length=win_length, window=window_torch, center=False, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
     spec = torch.sqrt(spec.pow(2).sum(-1)+(1e-9))
     mel_basis = torch.from_numpy( melbasis ).float().to(device)
     spec = torch.matmul(mel_basis, spec)
     spec = torch.log(torch.clamp(spec, min=1e-5) * 1)
     return spec.cpu().detach().numpy()[0]

I wonder in hifigan meldataset, why don't @jik876 you use the default padding of torch.stft, by setting center to True, is it different:

y = torch.nn.functional.pad(y.unsqueeze(1), (int((fft_size-hop_size)/2), int((fft_size-hop_size)/2)), mode='reflect').squeeze(1)
spec = torch.stft(y, fft_size, hop_length=hop_size, win_length=win_length, window=window_torch, center=False, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
spec = torch.stft(y, fft_size, hop_length=hop_size, win_length=win_length, window=window_torch, center=True, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
george-roussos commented 2 years ago

Hi all!

It has been some time since I last used HiFiGAN, but I will do my best to help with what I remember.

I never trained on multiple GPUs if I recall correctly, it was just one at a time (which meant training a full model to completion took ca. 10 days). And since it was a V100, my batch size I kept at 16. Yes it is true that smaller batch sizes make it harder to train because it is probably not enough for the generator to learn, I would guess especially in the case of HiFiGAN, which is trained jointly with the discriminator. Maybe you can try pre-training the generator for a few thousand steps.

I did no modifications when I finetuned on Taco2 predictions. I just continued from the latest checkpoint and provided my ground truth predictions.

I would think an fmax of 8000 should suffice, yes. It gives enough room to cover most voices. And for fmin might I suggest a value of 0, in order to deal with glottalisation.

It is true that spectrogram extraction differs between different TTS repos, but that should not have any effect on the ground truth mels or the vocoder, as long as it is trained on the same specs as the TTS model (ground truth and predictions). Also, I think it all comes down to the same result if the values you use for preprocessing are the same. Although I might of course be wrong.

Hope I helped!