Open Many0therFunctions opened 5 months ago
Output of Hifigan is waveform
I know. I was analyzing the wav files in spectral view to see. Something is very wrong there...
Same issue here.
Training run spectrograms and samples from trained weights attached:
Please describe your training dataset
I'm using Hindi (language_code = "hi") speech samples ~1200 from google/fleurs dataset. The language is part of the original xttsv2 checkpoint already.
I generated latents using generate_latents.py and the samples in /synthesis are decent. I can understand them. Training loss went down and came back up in like 20 epochs and stayed there for the rest of the training.
Same static noisy audio results from 20th epoch onwards
In my experiment, If it's not English, the amount of data needed to finetune is quite large. I think 1200 samples is a bit small. With Vietnamese, I have to use near 100hours to get good results
I see, thats interesting. I assumed it wouldn't require too much because the hifigan decoder from the original checkpoint was already trained on it. I'll experiment and let you know ! Thank you :)
................. if I didn't know better, I'd almost think it would be a better use of time and resources training an AI to instead convert GPT latents to their corresponding encodec tokens like with bark and then feeding that into a vocoder that uses encodec.....................
-mumbles about just wanting a simple fix to have xttsv2 handle screams and yells but it seems we have to do the roundabout thing like with bark redefining what this and that token maps to what sound.... -
That's interesting ! I added multiple new languages to xtts with decent speech output with a couple of hundred hours of speech for each language while sacrificing performance on original languages.
I didn't think it'd be that complex to add screams/yells and other custom sounds.
The reason I'm looking to train hifigan is to get human quality audio and I'm not sure where to go from here if this fails. Audio super resolution techniques have all failed for me.
I'm really disappointed because in bark it was child's play get such things, it's just bark was 1. WAY too slow, and 2. too unpredictable, which wouldnt be such an issue if it wasn't so God-Pounding SLOW
Yes, Bark samples on their demo page seem to be cherry picked. The inconsistency in audio generation doesn't work for my usecase. Agreed, its very very slow.
Just an FYI, @tuanh123789 is likely using a trained xtts checkpoint so the model state dict keys have xtts prefixed like this xtts.hifigan_decoder.waveform_decoder
. However the checkpoint on huggingface released by coqui has no xtts in the key name. I had to make a few more changes to load the original checkpoint correctly.
However after all this, the same issue happened. Train loss starts around 110, goes to approx 70 and goes back up to 78-81 and stabilises there. Eval audio improves from xtts level speech at iter 0 to static noise to barely legible speech with loud static noise.
Using 300 hours of English samples.
@Abhinay1997 hi, i got exactly the same issue like you. I'm training two languages (en and my lang) together, the spectrograms are as same as yours. But, the result is not so good, especially my lang, it's seems failed, en seems good. 200+ hours data.
@hscspring , did you modify train.py to load the huggingface xtts checkpoint or are you using your trained chekpoint ? I'm trying to see what the issue could be.
@Abhinay1997 Actually, you can use either the original or your finetuned checkpoint, because the hifigan checkpoints are the same. maybe you can finetune the speaker encoder together.
btw, i have the similar issue you've met. and i still don't know why.
@hscspring, True while the state_dict has same values, the keys are different. You can test this by trying to load the huggingface checkpoint with strict=True here. so you will be actually training the hifigan from scratch if you use the original checkpoint.
As for the other issue, I'm still checking. I'll do a training run on the weekend
@Abhinay1997 I modified the code. strict=True
is always a good habit.
I just found another issue (it's my own problem, i modifed the arch of xtts).
now waiting for the new result~
I have a heavy suspicion why. It's that this really is impossible without the official discriminator network and there's no way to just regenerate the discriminator having only the generator weights... I don't know.
Good catch there though. Definitely overlooked that.
(Training dataset here IS english so it should've been trivial to fine-tune and yet it is more like training completely from scratch which I definitely don't have the compute resources for. I really hope this training isn't supposed to be some monte carlo statistical reverse engineering of a discriminator network because that WILL require VAST amounts of compute and storage to be robust, especially using some of the more effective optimizers)
modify this in hifigan_decoder.py
:
- resblock_type,
- resblock_dilation_sizes,
- resblock_kernel_sizes,
- upsample_kernel_sizes,
- upsample_initial_channel,
- upsample_factors,
- inference_padding=5,
- cond_channels=0,
- conv_pre_weight_norm=True,
- conv_post_weight_norm=True,
- conv_post_bias=True,
- cond_in_each_up_layer=False,
+ resblock_type, # "1"
+ resblock_dilation_sizes, # [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
+ resblock_kernel_sizes, # [3, 7, 11]
+ upsample_kernel_sizes, # [16, 16, 4, 4]
+ upsample_initial_channel, # 512
+ upsample_factors, # [8, 8, 2, 2]
+ inference_padding=0,
+ cond_channels=512,
+ conv_pre_weight_norm=False,
+ conv_post_weight_norm=False,
+ conv_post_bias=False,
+ cond_in_each_up_layer=True,
and unsqueeze z in gpt_gan.py
:
z = batch["speaker_embedding"]
+ z = z.unsqueeze(-1)
@Abhinay1997 I modified the code.
strict=True
is always a good habit. I just found another issue (it's my own problem, i modifed the arch of xtts). now waiting for the new result~
@hscspring Thank you for confirming on these ! I made the same changes in hifigan_decoder.py
but wasn't sure if I messed something. Will have to compare the change in gpt_gan.py
as I remember using a transpose to pass the batch.
@Many0therFunctions thats a very valid point. Probably also why it requires so much data to train in the first place.
maybe
+ conv_pre_weight_norm=True,
+ conv_post_weight_norm=True,
+ conv_post_bias=True,
when training. (means not remove_parametrizations)
generator_model_params={
"cond_channels":512,
"cond_in_each_up_layer":True,
"conv_pre_weight_norm":False,
"conv_post_weight_norm":False,
"upsample_factors": [8, 8, 2, 2],
"upsample_kernel_sizes": [16, 16, 4, 4],
"upsample_initial_channel": 512,
"resblock_kernel_sizes": [3, 7, 11],
"resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
"resblock_type": "1",
"conv_post_bias":False,
"inference_padding": 0,
}
@hscspring I was using the same parameters for an English dataset. Using the lr and everything else from the repo. Including the gpt_gan
unsqueeze.
I had an oscillating loss. I haven't tried again with:
conv_pre_weight_norm=True, conv_post_weight_norm=True, conv_post_bias=True,
All trained hifigan models come out sounding like this. It just generates straight mel spectrogram bands.
https://vocaroo.com/18YQzfRyOJMV