mozilla / TTS

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Mozilla Public License 2.0
9.23k stars 1.24k forks source link

[Discussion] WaveGrad #518

Closed george-roussos closed 3 years ago

george-roussos commented 3 years ago

This is not an issue and is more of a discussion. I read the WaveGrad paper today (which may be found here) and listened to the samples here, which sound very good. There seems to be an open source implementation already here with great progress. Has anyone read the paper or used this implementation?

thorstenMueller commented 3 years ago

Thanks @george-roussos Maybe i'll try with default mozilla config values (except data path) to just ensure that my setup and dependencies are working correct - not for creating a useable model.

thorstenMueller commented 3 years ago

I never received this, but I have only trained using the fork here https://github.com/freds0/wavegrad Then I took the spectrogram from TTS and passed it there

Can you check if you are using the original LR etc. values?

Hey @george-roussos . I've decided to give repo from freds0 a try. Our taco2 model uses a hop length from 256, but in wavegrad params.py is hop_samples set to 300 with hint not to change this value. Is this a problem?

https://github.com/lmnt-com/wavegrad/blob/f46ca43e9cfc0cb8113560b97aacf524d797f721/src/wavegrad/params.py#L43

lexkoro commented 3 years ago

Hey @george-roussos . I've decided to give repo from freds0 a try. Our taco2 model uses a hop length from 256, but in wavegrad params.py is hop_samples set to 300 with hint not to change this value. Is this a problem?

https://github.com/lmnt-com/wavegrad/blob/f46ca43e9cfc0cb8113560b97aacf524d797f721/src/wavegrad/params.py#L43

You linked to the original repository.

freds0's fork has the needed adaptation to work with mozilla's tts. https://github.com/freds0/wavegrad/blob/master/src/wavegrad/params.py (only the stats_path is missing i guess)

thorstenMueller commented 3 years ago

Sorry @SanjaESC for the confusion. My fault - too late and too many open browser tabs ;-). Training is using the right repo. What confuses me is that Tensorboard doesn't producing audio samples (the only one shown has 0 seconds). Will audio samples be produced on further training steps?

Bildschirmfoto 2020-11-23 um 09 54 04 Bildschirmfoto 2020-11-23 um 09 53 57 Bildschirmfoto 2020-11-23 um 09 53 52
george-roussos commented 3 years ago

I think that is normal. When I used it it would only produce those 1 sec audio samples

thorstenMueller commented 3 years ago

Okay thanks. I'll keep training running. Just one question. Should i continue my questions/discussions on WaveGrad training progress on "thorsten" dataset in Mozilla discourse to not overtax this issue thread?

thorstenMueller commented 3 years ago

I'll go crazy . I tried several runs with freds0 training, double checked all config params (and adjusted them in freds0 if needed) and still have no success.

I tried these steps:

Now i want to try if our taco2 model is compatible with the WaveGrad training:

And this gives following error:

RuntimeError: Given groups=1, weight of size [768, 80, 3], expected input[1, 112, 80] to have 80 channels, but got 112 channels instead

Is this just a format definition error in array order or a configurational mismatch? I've no more ideas and i'm about to give up on WaveGrad.

george-roussos commented 3 years ago

Hi, sorry, this sounds frustrating. How do you extract the spectrogram from synthesize.py? I do it like this:

image

It always worked okay for me. My params.py looked like this:

params = AttrDict(
    # Training params
    batch_size=48,
    learning_rate=2e-4,
    max_grad_norm=1.0,

    # upsample factors
    #factors=[5, 5, 3, 2, 2], # 5*5*3*2*2=300 (hop_lenght=300)
    factors=[4, 4, 4, 2, 2], # 4*4*4*2*2=256 (this is necessary to be equal to hop_lenght)

    # Audio params
    num_mels=80,   
    fft_size=1024,     
    sample_rate=22050, 
    win_length=1024,  
    hop_length=256, # if you change that change factors

    frame_length_ms=None, 
    frame_shift_ms=None,  
    preemphasis=0.98,   
    min_level_db=-100,  
    ref_level_db=20,     
    power=1.5,           
    griffin_lim_iters=60,
    stft_pad_mode="reflect",
    signal_norm=True,    
    symmetric_norm=True, 
    max_norm=4.0,       
    clip_norm=True,  
    mel_fmin=0.0,      
    mel_fmax=8000.0,      
    spec_gain=20.0, 
    do_trim_silence=False,  
    trim_db=60,

    # Data params
    crop_mel_frames=24,
    # Model params
    noise_schedule=np.linspace(1e-6, 0.01, 1000).tolist(),
)

Some observations: I did not perform mean-var. So I am not sure where do_sound_norm should be set to True or not, even if you are doing mean-var training. Just try adding your stats_path and let do_sound_norm set to False if it ends up not working. But this all above is assuming WaveGrad is not working correctly. If it is, then it is not relevant and your configuration is correct.

lexkoro commented 3 years ago

Is this just a format definition error in array order or a configurational mismatch?

I guess you are trying to load the .npy file created during preprocess? I think it wasn't intended for inference just for training that's why you get the mismatch. You could use transpose to match the expected order.

But the easiest way is to test it on synthesized spectrograms directly from TTS. @george-roussos linked an examples above, how you can save it to an npy file. Further above is also example code to run it directly. https://github.com/mozilla/TTS/issues/518#issuecomment-696117318

thorstenMueller commented 3 years ago

@george-roussos :

Hi, sorry, this sounds frustrating.

Yes, a little but I knew what I was getting into, so I'll keep trying. My code modifications looks similar as i received a comparable code snipplet from @domcross too. I started working on a pr yesterday for saving spectograms during synthesize.py. I'll check your ideas as soon as possible, thanks.

@SanjaESC :

I guess you are trying to load the .npy file created during preprocess?

No, i saved the spectograms in synthesize.py called from tts server speaking a random phrase (no ground truth). So hopefully this is the right spectogram for further processing through WaveGrad.

Even if it is complex I don't plan to give up, so thanks for your great support guys :-).

lexkoro commented 3 years ago

Could you share the code you are using to save it? The problem lays in the format you are exporting the spectrogram in.

thorstenMueller commented 3 years ago

around line 180 in synthesizer.py (dev branch):

else:
                # use GL
                if self.use_cuda:
                    postnet_output = postnet_output[0].cpu()

                    # Save spectogram for further processing in wavegrad
                    # =========================
                    spec_filename = "spec-" + sen.replace(" ", "_")
                    np.save("/tmp/" + spec_filename + ".npy", postnet_output)
                    # =========================
                else:
                    postnet_output = postnet_output[0]
                postnet_output = postnet_output.numpy()
                wav = inv_spectrogram(postnet_output, self.ap, self.tts_config)

btw. the tts-function definition is the following:

    def tts(self, text, speaker_id=None):
lexkoro commented 3 years ago

try this

            # use GL
            if self.use_cuda:
                # Save spectogram for further processing in wavegrad
                # =========================
                spec_filename = "spec-" + sen.replace(" ", "_")
                np.save("/tmp/" + spec_filename + ".npy", postnet_output[0].T)
                # =========================

                postnet_output = postnet_output[0].cpu()
            else:
                postnet_output = postnet_output[0]
            postnet_output = postnet_output.numpy()
            wav = inv_spectrogram(postnet_output, self.ap, self.tts_config)
thorstenMueller commented 3 years ago

Thanks @SanjaESC . This leads to following error:

can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

I played around with adding .cpu, etc. (found some tips on googling that error) without success. Would it help to check another vocoder than GL?

lexkoro commented 3 years ago

np.save("/tmp/" + spec_filename + ".npy", postnet_output[0].T.cpu()) should also work i guess.

Should i continue my questions/discussions on WaveGrad training progress on "thorsten" dataset in Mozilla discourse to not overtax this issue thread?

I guess at this point it would be a good idea. ^^ You are welcome to write me a PM at discourse.

erogol commented 3 years ago

One simple trick to run WaveGrad even faster.

  1. Generate a rough waveform with Griffin-Lim
  2. Give it to WaveGrad and use it instead of random noise as input.

It is able to generate way higher quality in 6 iterations compared to the normal run. And you can even generate good quality in 3 iterations.

Vproject commented 3 years ago

Might the same be true if the output of say Multi-Band MelGAN were fed to WaveGrad?

thorstenMueller commented 3 years ago

I took code snipplets from @SanjaESC and @domcross and put together this pr (https://github.com/mozilla/TTS/pull/580). Maybe it's more easy for future beginners (like me) to work with models raw spectrograms. Thank you guys for your great support.

Shuxinz commented 3 years ago

One simple trick to run WaveGrad even faster.

  1. Generate a rough waveform with Griffin-Lim
  2. Give it to WaveGrad and use it instead of random noise as input.

It is able to generate way higher quality in 6 iterations compared to the normal run. And you can even generate good quality in 3 iterations.

Does the model need to be retrained ? Since the input is no longer random noise, but waveform.

erogol commented 3 years ago

retrain is better but it should also work otherwise.

thorstenMueller commented 3 years ago

With kind permission from @SanjaESC i've send a pr for improve robustness for synthesize with wavegrad. https://github.com/mozilla/TTS/pull/602

@nmstoker you've had same issue i ran into too (https://github.com/mozilla/TTS/issues/518#issuecomment-727254570)

AttributeError: 'NoneType' object has no attribute 'to'

I've no idea if you still encounter this problem but it should be addressed with this pr. Thanks @SanjaESC for your 🥇 first class support and permission to pack you knowhow into a pr.

ysujiang commented 3 years ago

One simple trick to run WaveGrad even faster.

  1. Generate a rough waveform with Griffin-Lim
  2. Give it to WaveGrad and use it instead of random noise as input.

It is able to generate way higher quality in 6 iterations compared to the normal run. And you can even generate good quality in 3 iterations.

@erogol hello, I do the experiment in your way you suggestted, but i find waveform with Griffin-Lim is less than the noise ,and the missing size is hop_length . How do you deal with Griffin-Lim waveform ?

mirfan899 commented 3 years ago

I am trying to use it with the model https://discourse.mozilla.org/t/creating-a-github-page-for-hosting-community-trained-models/70889/11 and getting the following error.

(TTS) irfan@gini:~/PycharmProjects/TTS$ python TTS/bin/synthesize.py --text "Text for TTS" --model_path /home/irfan/Downloads/ek/ek1_model.pth.tar --config_path /home/irfan/Downloads/ek/model_config.json --out_path speech.wav --vocoder_path /home/irfan/Downloads/ek/ek1_vocoder.pth.tar --vocoder_config_path /home/irfan/Downloads/ek/vocoder_config.json --use_cuda True
Traceback (most recent call last):
  File "TTS/bin/synthesize.py", line 13, in <module>
    from TTS.tts.utils.generic_utils import setup_model
  File "/home/irfan/PycharmProjects/TTS/TTS/tts/utils/generic_utils.py", line 7, in <module>
    from TTS.utils.generic_utils import check_argument
  File "/home/irfan/PycharmProjects/TTS/TTS/utils/generic_utils.py", line 6, in <module>
    from contextlib import nullcontext
ImportError: cannot import name 'nullcontext'

How can I fix this?

I'm using Python 3.6.9 on Ubuntu 18

alexdemartos commented 3 years ago

retrain is better but it should also work otherwise.

This is a great idea. I'm sorry I am not really experienced with these diffusion models, so I would like to ask for advice: how would you exactly train using griffin-lim as noise input?

Would def compute_y_n(self, y_0): take an additional noise parameter with the griffin-lim waveform instead of using noise = torch.randn_like(y_0)? Would this be sufficient?

If so, how should the inference method work? Will z = torch.randn_like(y_n) always be the griffin-lim waveform? I am a bit lost.

Thanks for your help.

erogol commented 3 years ago

It is the latter. Basically, you don't start off with a random noise but the GL output to diffuse.

BTW I don't maintain this code anymore. Come to :frog: TTS

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts