lucidrains / voicebox-pytorch

Implementation of Voicebox, new SOTA Text-to-speech network from MetaAI, in Pytorch
MIT License
589 stars 49 forks source link

Deep network not converge #23

Closed lixuyuan102 closed 11 months ago

lixuyuan102 commented 11 months ago

Hello, I am trying to train this network. I find the network can not converge when the depth is set as 24. However, it works well when the depth is set to 12. Does anyone find this problem?

lucidrains commented 11 months ago

@lixuyuan102 you are just experiencing normal transformer instability (you can expect any transformer > 16 layers to be difficult to train, but obviously with high rewards if you do train it..)

i'll flesh out the training code soon (with a bunch of good default settings)

lucidrains commented 11 months ago

@lixuyuan102 what kind of results are you seeing with 12 layers? did you fix the issue where the mel decoder is not an exact inverse of the encoder?

lixuyuan102 commented 11 months ago

@lixuyuan102 what kind of results are you seeing with 12 layers? did you fix the issue where the mel decoder is not an exact inverse of the encoder? To be honest, I employed the Unet network to train the voicebox and got acceptable results at first. Then I used the transformer model with 24 layers as they mentioned in the paper. The losses do not tend to decrease even after 30,000 iterations of training. However, when I changed the layer depth to 12, the loss plummeted to 30% of the initial in 200 - 300 iterations just like it did with Unet.

lixuyuan102 commented 11 months ago

@lixuyuan102 what kind of results are you seeing with 12 layers? did you fix the issue where the mel decoder is not an exact inverse of the encoder?

BTW, I do not understand the "inverse" issue. The network implemented by myself is a Transformer encoder-only architecture with skip connections. It works with 12 layers. So it may not have fatal errors.

lucidrains commented 11 months ago

@lixuyuan102 hey, would you like to retry 24 or even 16 layers? i added an intervention that should lead to more stable training

also, try using the Trainer class that Lucas (who already successfully trained a small voicebox) has used. it includes gradient clipping, warmups, all the standard practice for training transformers.

lucidrains commented 11 months ago

@lixuyuan102 i also turned off the skip connections by default, as i am skeptical of that proposal in the paper. (no ablations, studies on how it affects transformer training, etc)

lucasnewman commented 11 months ago

@lucidrains I tried a 24 layer network with the qk norm enabled and it converges now 🚀

lixuyuan102 commented 11 months ago

@lixuyuan102 i also turned off the skip connections by default, as i am skeptical of that proposal in the paper. (no ablations, studies on how it affects transformer training, etc)

Thanks for your reply, I have used adaLN instead of normal residual links in Transformer and successfully trained deeper networks. Also, I have observed that Unet's skip links are useful for the early convergence of deeper networks.

lucidrains commented 11 months ago

@lixuyuan102 great to hear! closing the issue

please share some of your side by side experiments for the unet skip connection; maybe if we dig a bit deeper, it could be applicable to language modeling? however, i am still skeptical until it is well presented instead of just anecdata

blldd commented 11 months ago

@lixuyuan102 i also turned off the skip connections by default, as i am skeptical of that proposal in the paper. (no ablations, studies on how it affects transformer training, etc)

Thanks for your reply, I have used adaLN instead of normal residual links in Transformer and successfully trained deeper networks. Also, I have observed that Unet's skip links are useful for the early convergence of deeper networks.

Hi Xuyuan, I am also interested in the convergence problem, can you share your training script? I will follow your data preparation and model training to gain a deeper understanding of the problem. Thanks in advance.

lixuyuan102 commented 11 months ago

@lucidrains Most notably the Unet skip can double the number of layers of a convergent model in my test. For example, an 8-layer network can converge without using Unet skip. However, a 16-layer network cannot converge before employing Unet skip. Due to my time constraints, I am not currently using ablation to test final generation performance. It is worth mentioning that I have used Mel features as the training target for the model and not the features learned by the vocoder. The former requires more convolutional positional coding to generate good results in my experiments. This means that the type of speech intermediate features may also affect the training of the model. I will follow up with related ablation experiments. If you plan to do further research on a large language model in the future, I'd appreciate you sharing the results!

lixuyuan102 commented 11 months ago

@lixuyuan102 i also turned off the skip connections by default, as i am skeptical of that proposal in the paper. (no ablations, studies on how it affects transformer training, etc)

Thanks for your reply, I have used adaLN instead of normal residual links in Transformer and successfully trained deeper networks. Also, I have observed that Unet's skip links are useful for the early convergence of deeper networks.

Hi Xuyuan, I am also interested in the convergence problem, can you share your training script? I will follow your data preparation and model training to gain a deeper understanding of the problem. Thanks in advance.

I'd love to share my code, but I'm preparing for graduate mid-term defense. I will organize the code and share it on Github or send it to your email within a week.

lucidrains commented 11 months ago

@lucidrains Most notably the Unet skip can double the number of layers of a convergent model in my test. For example, an 8-layer network can converge without using Unet skip. However, a 16-layer network cannot converge before employing Unet skip. Due to my time constraints, I am not currently using ablation to test final generation performance. It is worth mentioning that I have used Mel features as the training target for the model and not the features learned by the vocoder. The former requires more convolutional positional coding to generate good results in my experiments. This means that the type of speech intermediate features may also affect the training of the model. I will follow up with related ablation experiments. If you plan to do further research on a large language model in the future, I'd appreciate you sharing the results!

ok, that sounds encouraging. I originally made it optional because I thought it was the source of the instability. but now knowing you trained it successfully with it on, I'll give it a try with language modeling. will share what I see here

blldd commented 11 months ago

@lixuyuan102 i also turned off the skip connections by default, as i am skeptical of that proposal in the paper. (no ablations, studies on how it affects transformer training, etc)

Thanks for your reply, I have used adaLN instead of normal residual links in Transformer and successfully trained deeper networks. Also, I have observed that Unet's skip links are useful for the early convergence of deeper networks.

Hi Xuyuan, I am also interested in the convergence problem, can you share your training script? I will follow your data preparation and model training to gain a deeper understanding of the problem. Thanks in advance.

I'd love to share my code, but I'm preparing for graduate mid-term defense. I will organize the code and share it on Github or send it to your email within a week.

Thanks a lot, my email is ddlecnu@gmail.com. Good luck with your defense.