Closed yigityu closed 1 year ago
@yigityu hey, thanks for reporting
could you try turning this setting off ? if that doesn't work, also turn off this
Thanks, trying now.
@yigityu how high is your learning rate? after adding the local attentions, learning rate may need to be lowered to around 2e-4
I kept it at default of 3e-4 until now. I'll then do two trials - one to confirm that turning these options off works after 1000+ steps (running now), and running with these options + lowered learning rate. Will report results.
@yigityu thank you! :pray:
FWIW 0.7.1 doesn't appear to be training too well for me either. I'm 1200 steps in (~1 day) and the samples are just high pitched noise. With 0.5.1 at the same # of steps the samples were starting to resemble the source. I'm going to let 0.7.1 train a while longer and will try some of the suggestions here...
I'm afraid turning off these two options didn't change the result.
Tuning down the lr to 2e-4 just delayed the problem to 5000 steps. Here are the graphs for tuned down lr:
I will try rolling back further as well if I can find the change.
that only leaves this change https://github.com/lucidrains/audiolm-pytorch/commit/36c3954033caa535c397574c891da5711f056a37 as the culprit, between 0.5.1 and 0.7, I'll switch it back soon
thank you for these experiments and your patience
Meanwhile I rolled back to daeedb2 and it's still stable after 9000 steps.
thank you for these experiments and your patience
my pleasure.
@yigityu ok, do you want to try 0.7.2 ?
@yigityu what were your training parameters? In 0.5.1 after 200,000 steps (~3 days) I still only had pitched noise, the last release which started to resemble the signal (after 32,000 steps) was 0.2.3. But I`m beginning to think I do not use the framework correctly as it seems to work with others.
@yigityu ok, do you want to try 0.7.2 ?
Sure, I will try out and let you know the results.
@yigityu what were your training parameters? In 0.5.1 after 200,000 steps (~3 days) I still only had pitched noise, the last release which started to resemble the signal (after 32,000 steps) was 0.2.3. But I`m beginning to think I do not use the framework correctly as it seems to work with others.
I'm using default params listed in readme, but I didn't really yet achieve anything but noise, I'm just trying to stabilize training for at least couple of days but I'm really hopeful that we'll get it working.
@yigityu have a look at my trials: https://github.com/lucidrains/audiolm-pytorch/issues/57#issuecomment-1401986720 If you have the time you could also try it with 0.2.3. I would really be interested if this works better for you too
Both runs in progress,
Run with 0.7.2 - with lr=3e-4 same issue at around 16000 steps. Trying now with a smaller learning rate from an earlier checkpoint. Run with daeedb2 with lr=3e-4 losses didn't seem to stabilize enough. Reduced lr to 2e-4 and now at 60k steps. Still noise but it sounds somewhat less garbage.
@BlackFox1197 I think we're probably getting similar results here. I will keep it running little bit further, but might need to go back to 0.2.3 to see if it seems better.
yeah I think it is fine now, just keep training as long as nothing explodes. djqualia said he trained to 10 days on A6000, for reference
Full disclosure, that 10 day training was at version 0.2.x or below... please see the other related thread where I think something has regressed since 0.2.3 which is the best results I've had. I wish it was easy to test these different newer versions in parallel to pinpoint but it's rather time consuming. I'll continue to try and collect better A/B data and share here....
@djqualia oh noes
how sure are you that it regressed? the only other offending diff would be https://github.com/lucidrains/audiolm-pytorch/commit/8259b0d03ce0fe49d9dcc49ae29e0ccbb704e7bc maybe i should roll that back
the other thing is that anything prior to 0.3 had a bug for the discriminators, so i'm not sure how it could have learned in the past
@yigityu @BlackFox1197 @djqualia ok, decided to just bring back the complex stft discriminator in 0.7.3 could one of you run this and see if this is any better
if it does not, we'll just keep rolling back changes until it works again
i'll set aside some time to run experiments too, maybe this Sunday
Sure thing, I started a new training run with 0.7.3 just now, will report back tomorrow...
@djqualia thank you!
Just a quick check before I go to work (too soon to tell I think), but 0.7.3 appears to be a bit better. I'm only on step 720 with my config, but I'm hearing something resembling the source sounds behind some high pitch noise. Hopefully it keeps improving!
I'm also running a separate training on 0.7.2 (not apples to apples since it's separate hardware), to try and determine if the regression was fixed with that too (just in case)
@djqualia tremendous! keep us updated 🙏
I have 8000 steps now in 0.7.3 and its way better than 0.7.1 or 0.5.1 I'm gonna try to hit 200,000 - 300,000 steps over the weekend, but it looks very good!
@BlackFox1197 🥳 🚀 hurray! time to democratize audio encoding 😄
however, this would also mean this issue is back to being unresolved 😢
i'll play around with these settings over the weekend
i got them from the way encodec approached their stft discriminator, but perhaps something wasn't done correctly. for now, we'll just stick with single machine complex stft discriminator
the encodec non-complex stft discriminator is also using weight normalization while I'm using gradient penalty
could be another reason why it isn't working
what would actually be best is if the pytorch team resolves their issue
yep, i agree and can also confirm that 0.7.3 is doing better!
my comparison to 0.7.2 is less conclusive/scientific, but 0.7.2 does seem worse to me. i wonder if that means the change from 0.7.1->0.7.2 was unnecessary?
@djqualia woohoo!
i'm not sure re: 0.7.1 -> 0.7.2 @yigityu seems to have thought it became more stable, but we will probably have to just try the encodec activation configuration on the complex stft discriminator (and compare to the way it is now + complex stft discr)
I'm sort of in the "if it ain't broke don't try to fix it" stage 😂
ok unfortunately I have bad news.
It looked fine at first, but After 18,000 steps this is what I get for an error:
soundstream total loss: nan, soundstream recon loss: nan | discr (scale 1) loss: nan | discr (scale 0.5) loss: nan | discr (scale 0.25) loss: nan
and it started exploding after 16,000 steps In addition, after 12,000 steps it really degraded, And with the checkpoint at 18,000 steps, it does not event produce an real output anymore....
tensor([[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]])
@BlackFox1197 ahh yeah, so that is called divergence. could you get your loss curves and share those?
Try lowering the learning rate, or removing attention in the next run
that's fine though, it is basically working early on, not that concerned
@BlackFox1197 the typical thing we do when adversarial training is involved is to save checkpoints and pick the one just before it explodes
It still shouldn't explode this early, but just so you know
with the new https://google-research.github.io/seanet/musiclm/examples/, should probably just get my hands dirty and start training
basically same as what is in the repo
oh, they actually have a new audio clip model in here called MuLan... ok, gotta build that here then too (or separate repo)
Regardless, getting soundstream trained is a priorty now
I'm at 25k steps with version 0.7.3 and seems good so far - still a lot of noise but stable training.
Looks like new challenges are piling up @lucidrains, let's get this soundstream trained :)
@yigityu yesss lets do this! stable diffusion moment for audio (sounds, speech, music, whatever) here we come!
soundstream is the linchpin
@yigityu is your training run on 0.7.3 with all the settings turned on? (attention and learned EMA?)
Yes, those are all turned on.
@yigityu learning rate still at 2e-4
? @BlackFox1197 how big is your training set? are you doing LibreSpeech?
I'm keeping that 3e-4, but was planning to reduce it depending on how it goes.
FWIW my training at 0.7.3 is still stable ~1.5 days in. I am using a learning rate of 3e-4 and effective batch size of 96 and data_max_length of 16000 or 10s, with everything else as it defaults in 0.7.3. In my experience, with lower batch sizes you might need to lower the learning rate...
+1 to the seizing the "stable diffusion moment" for audio, and the importance of Soundstream to enable all that here... thanks @lucidrains for your work!
@djqualia no problem! happy training everyone!
I know this issue is closed - but I have some good news so wanted to give an update on my runs if it helps anyone:
My run with daeedb2 with lr reduced to 2e-4 seems stable now. 155k steps and I can hear a human voice in the background :)
Hi,
First of all, thank you for this project and all the other open source projects you're doing. I'm a big fan of your work.
I was training with the latest version on LibriSpeech dataset and looks like recon_loss shoots up and training goes to nowhere afterwards. I didn't seem to have this with previous releases, but I will roll back, try again and report results here, but this might be a regression with latest changes? I wanted to post it if it would help anyone.