tuanh123789 / Train_Hifigan_XTTS

This is an implementation for train hifigan part of XTTSv2 model using Coqui/TTS.
42 stars 11 forks source link

Train model does not generate any speech at all #3

Open Many0therFunctions opened 2 months ago

Many0therFunctions commented 2 months ago

All trained hifigan models come out sounding like this. It just generates straight mel spectrogram bands.

image

https://vocaroo.com/18YQzfRyOJMV

tuanh123789 commented 1 month ago

Output of Hifigan is waveform

Many0therFunctions commented 1 month ago

I know. I was analyzing the wav files in spectral view to see. Something is very wrong there...

Abhinay1997 commented 1 month ago

Same issue here.

Training run spectrograms and samples from trained weights attached: speech_comparison fake

https://voca.ro/19j9UzebhNzw

tuanh123789 commented 1 month ago

Please describe your training dataset

Abhinay1997 commented 1 month ago

I'm using Hindi (language_code = "hi") speech samples ~1200 from google/fleurs dataset. The language is part of the original xttsv2 checkpoint already.

I generated latents using generate_latents.py and the samples in /synthesis are decent. I can understand them. Training loss went down and came back up in like 20 epochs and stayed there for the rest of the training.

Same static noisy audio results from 20th epoch onwards

tuanh123789 commented 1 month ago

In my experiment, If it's not English, the amount of data needed to finetune is quite large. I think 1200 samples is a bit small. With Vietnamese, I have to use near 100hours to get good results

Abhinay1997 commented 1 month ago

I see, thats interesting. I assumed it wouldn't require too much because the hifigan decoder from the original checkpoint was already trained on it. I'll experiment and let you know ! Thank you :)

Many0therFunctions commented 1 month ago

................. if I didn't know better, I'd almost think it would be a better use of time and resources training an AI to instead convert GPT latents to their corresponding encodec tokens like with bark and then feeding that into a vocoder that uses encodec.....................

-mumbles about just wanting a simple fix to have xttsv2 handle screams and yells but it seems we have to do the roundabout thing like with bark redefining what this and that token maps to what sound.... -

Abhinay1997 commented 1 month ago

That's interesting ! I added multiple new languages to xtts with decent speech output with a couple of hundred hours of speech for each language while sacrificing performance on original languages.

I didn't think it'd be that complex to add screams/yells and other custom sounds.

The reason I'm looking to train hifigan is to get human quality audio and I'm not sure where to go from here if this fails. Audio super resolution techniques have all failed for me.

Many0therFunctions commented 1 month ago

I'm really disappointed because in bark it was child's play get such things, it's just bark was 1. WAY too slow, and 2. too unpredictable, which wouldnt be such an issue if it wasn't so God-Pounding SLOW

Abhinay1997 commented 1 month ago

Yes, Bark samples on their demo page seem to be cherry picked. The inconsistency in audio generation doesn't work for my usecase. Agreed, its very very slow.

Abhinay1997 commented 1 month ago

Just an FYI, @tuanh123789 is likely using a trained xtts checkpoint so the model state dict keys have xtts prefixed like this xtts.hifigan_decoder.waveform_decoder. However the checkpoint on huggingface released by coqui has no xtts in the key name. I had to make a few more changes to load the original checkpoint correctly.

However after all this, the same issue happened. Train loss starts around 110, goes to approx 70 and goes back up to 78-81 and stabilises there. Eval audio improves from xtts level speech at iter 0 to static noise to barely legible speech with loud static noise.

Using 300 hours of English samples.

hscspring commented 1 month ago

@Abhinay1997 hi, i got exactly the same issue like you. I'm training two languages (en and my lang) together, the spectrograms are as same as yours. But, the result is not so good, especially my lang, it's seems failed, en seems good. 200+ hours data.

Abhinay1997 commented 1 month ago

@hscspring , did you modify train.py to load the huggingface xtts checkpoint or are you using your trained chekpoint ? I'm trying to see what the issue could be.

hscspring commented 1 month ago

@Abhinay1997 Actually, you can use either the original or your finetuned checkpoint, because the hifigan checkpoints are the same. maybe you can finetune the speaker encoder together.

btw, i have the similar issue you've met. and i still don't know why.

Abhinay1997 commented 1 month ago

@hscspring, True while the state_dict has same values, the keys are different. You can test this by trying to load the huggingface checkpoint with strict=True here. so you will be actually training the hifigan from scratch if you use the original checkpoint.

As for the other issue, I'm still checking. I'll do a training run on the weekend

hscspring commented 1 month ago

@Abhinay1997 I modified the code. strict=True is always a good habit. I just found another issue (it's my own problem, i modifed the arch of xtts). now waiting for the new result~

Many0therFunctions commented 1 month ago

I have a heavy suspicion why. It's that this really is impossible without the official discriminator network and there's no way to just regenerate the discriminator having only the generator weights... I don't know.

Good catch there though. Definitely overlooked that.

(Training dataset here IS english so it should've been trivial to fine-tune and yet it is more like training completely from scratch which I definitely don't have the compute resources for. I really hope this training isn't supposed to be some monte carlo statistical reverse engineering of a discriminator network because that WILL require VAST amounts of compute and storage to be robust, especially using some of the more effective optimizers)

hscspring commented 1 month ago

modify this in hifigan_decoder.py:

-        resblock_type,
-        resblock_dilation_sizes,
-        resblock_kernel_sizes,
-        upsample_kernel_sizes,
-        upsample_initial_channel,
-        upsample_factors,
-        inference_padding=5,
-        cond_channels=0,
-        conv_pre_weight_norm=True,
-        conv_post_weight_norm=True,
-        conv_post_bias=True,
-        cond_in_each_up_layer=False,
+        resblock_type, # "1"
+        resblock_dilation_sizes, # [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
+        resblock_kernel_sizes, # [3, 7, 11]
+        upsample_kernel_sizes, # [16, 16, 4, 4]
+        upsample_initial_channel, # 512
+        upsample_factors, # [8, 8, 2, 2]
+        inference_padding=0,
+        cond_channels=512,
+        conv_pre_weight_norm=False,
+        conv_post_weight_norm=False,
+        conv_post_bias=False,
+        cond_in_each_up_layer=True,

and unsqueeze z in gpt_gan.py:

             z = batch["speaker_embedding"]
+        z = z.unsqueeze(-1)

@Abhinay1997 I modified the code. strict=True is always a good habit. I just found another issue (it's my own problem, i modifed the arch of xtts). now waiting for the new result~

Abhinay1997 commented 1 month ago

@hscspring Thank you for confirming on these ! I made the same changes in hifigan_decoder.py but wasn't sure if I messed something. Will have to compare the change in gpt_gan.py as I remember using a transpose to pass the batch.

Abhinay1997 commented 1 month ago

@Many0therFunctions thats a very valid point. Probably also why it requires so much data to train in the first place.

hscspring commented 1 month ago

maybe

 +        conv_pre_weight_norm=True,
+        conv_post_weight_norm=True,
+        conv_post_bias=True,

when training. (means not remove_parametrizations)

Abhinay1997 commented 1 month ago
generator_model_params={
            "cond_channels":512,
            "cond_in_each_up_layer":True,
            "conv_pre_weight_norm":False, 
            "conv_post_weight_norm":False,
            "upsample_factors": [8, 8, 2, 2],
            "upsample_kernel_sizes": [16, 16, 4, 4],
            "upsample_initial_channel": 512,
            "resblock_kernel_sizes": [3, 7, 11],
            "resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
            "resblock_type": "1",
            "conv_post_bias":False,
            "inference_padding": 0,
        }

@hscspring I was using the same parameters for an English dataset. Using the lr and everything else from the repo. Including the gpt_gan unsqueeze.

Screenshot 2024-06-29 at 6 02 05 PM Screenshot 2024-06-29 at 6 01 29 PM

I had an oscillating loss. I haven't tried again with:

conv_pre_weight_norm=True, conv_post_weight_norm=True, conv_post_bias=True,