p0p4k / vits2_pytorch

unofficial vits2-TTS implementation in pytorch
https://arxiv.org/abs/2307.16430
MIT License
465 stars 81 forks source link

Training stuck #71

Open Madhavan0123 opened 8 months ago

Madhavan0123 commented 8 months ago

Hello ,

Thanks for all the effort to create this repo. When I launch training it runs for a few steps and then I see no progress at all. Its just stuck without any progress for a long time. It still hasnt progressed. INFO:ljs_base:Saving model and optimizer state at iteration 1 to ./logs/ljs_base/G_0.pth INFO:ljs_base:Saving model and optimizer state at iteration 1 to ./logs/ljs_base/D_0.pth INFO:ljs_base:Saving model and optimizer state at iteration 1 to ./logs/ljs_base/DUR_0.pth Loading train data: 4%|████████████▍

have you encountered this before ? Any help would be extremely helpful

p0p4k commented 8 months ago

Hi, temporarily turn off duration discriminator and tell me if it works.

Madhavan0123 commented 8 months ago

Yes it seems to be working for now. Any reason with the duration discriminator is causing the issue ?

p0p4k commented 8 months ago

I feel my implementation was too naive. Might need to correct it with some testing. Busy on other models now, will do it when I have some time on me. Let me know about the audio quality after you train. Thanks.

CreepJoye commented 7 months ago

Hello,thank you for your great effort ! I meet the same problem and I want to know will you correct it recently or still busy on other models?

p0p4k commented 7 months ago

I have moved to improving pflowtts.

JohnHerry commented 6 months ago

I have moved to improving pflowtts.

Hi, p0p4k, How is the pflowtts growing now? is it a better choice then vits2? can it support both normal tts and zero-shot tts?

p0p4k commented 6 months ago

I think better than vits/vits2. Only downside it not being e2e.

JohnHerry commented 6 months ago

I think better than vits/vits2. Only downside it not being e2e.

ok, thank you.

codeghees commented 4 months ago

@p0p4k do you know the bug here?

p0p4k commented 4 months ago

@codeghees which part ? training stuck part?

codeghees commented 4 months ago

yep

codeghees commented 4 months ago

In the same boat.

p0p4k commented 4 months ago

@codeghees did not look into this personally cause of no gpu yet. Maybe you can try to debug and send a PR. I can assist you. Thanks a lot!

codeghees commented 4 months ago

Yep, will do! Trying to debug this.

codeghees commented 4 months ago

@p0p4k bug is on line scaler.scale(loss_gen_all).backward()

Seems like GradScalar has issues with multi-gpu. I removed it and replaced it with standard backprop. The issue persists. Looks like a multi GPU issue.

p0p4k commented 4 months ago

Works on single gpu?

codeghees commented 4 months ago

Haven't tested yet. Trying a run with fp16 enabled.

farzanehnakhaee70 commented 2 months ago

@p0p4k I have no issues for single GPU training. But it will stuck if I do multiple GPU training. Any success for resolving the issue?