lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.45k stars 266 forks source link

Soundstream loss doesn't decrease after 1167 steps - version 0.7.1 #61

Closed yigityu closed 1 year ago

yigityu commented 1 year ago

Hi,

First of all, thank you for this project and all the other open source projects you're doing. I'm a big fan of your work.

I was training with the latest version on LibriSpeech dataset and looks like recon_loss shoots up and training goes to nowhere afterwards. I didn't seem to have this with previous releases, but I will roll back, try again and report results here, but this might be a regression with latest changes? I wanted to post it if it would help anyone.

image
lucidrains commented 1 year ago

@yigityu hey, thanks for reporting

could you try turning this setting off ? if that doesn't work, also turn off this

yigityu commented 1 year ago

Thanks, trying now.

lucidrains commented 1 year ago

@yigityu how high is your learning rate? after adding the local attentions, learning rate may need to be lowered to around 2e-4

yigityu commented 1 year ago

I kept it at default of 3e-4 until now. I'll then do two trials - one to confirm that turning these options off works after 1000+ steps (running now), and running with these options + lowered learning rate. Will report results.

lucidrains commented 1 year ago

@yigityu thank you! :pray:

djqualia commented 1 year ago

FWIW 0.7.1 doesn't appear to be training too well for me either. I'm 1200 steps in (~1 day) and the samples are just high pitched noise. With 0.5.1 at the same # of steps the samples were starting to resemble the source. I'm going to let 0.7.1 train a while longer and will try some of the suggestions here...

yigityu commented 1 year ago

I'm afraid turning off these two options didn't change the result.

Tuning down the lr to 2e-4 just delayed the problem to 5000 steps. Here are the graphs for tuned down lr:

image

I will try rolling back further as well if I can find the change.

lucidrains commented 1 year ago

that only leaves this change https://github.com/lucidrains/audiolm-pytorch/commit/36c3954033caa535c397574c891da5711f056a37 as the culprit, between 0.5.1 and 0.7, I'll switch it back soon

lucidrains commented 1 year ago

thank you for these experiments and your patience

yigityu commented 1 year ago

Meanwhile I rolled back to daeedb2 and it's still stable after 9000 steps.

yigityu commented 1 year ago

thank you for these experiments and your patience

my pleasure.

lucidrains commented 1 year ago

@yigityu ok, do you want to try 0.7.2 ?

ckwdani commented 1 year ago

@yigityu what were your training parameters? In 0.5.1 after 200,000 steps (~3 days) I still only had pitched noise, the last release which started to resemble the signal (after 32,000 steps) was 0.2.3. But I`m beginning to think I do not use the framework correctly as it seems to work with others.

yigityu commented 1 year ago

@yigityu ok, do you want to try 0.7.2 ?

Sure, I will try out and let you know the results.

@yigityu what were your training parameters? In 0.5.1 after 200,000 steps (~3 days) I still only had pitched noise, the last release which started to resemble the signal (after 32,000 steps) was 0.2.3. But I`m beginning to think I do not use the framework correctly as it seems to work with others.

I'm using default params listed in readme, but I didn't really yet achieve anything but noise, I'm just trying to stabilize training for at least couple of days but I'm really hopeful that we'll get it working.

ckwdani commented 1 year ago

@yigityu have a look at my trials: https://github.com/lucidrains/audiolm-pytorch/issues/57#issuecomment-1401986720 If you have the time you could also try it with 0.2.3. I would really be interested if this works better for you too

yigityu commented 1 year ago

Both runs in progress,

Run with 0.7.2 - with lr=3e-4 same issue at around 16000 steps. Trying now with a smaller learning rate from an earlier checkpoint. Run with daeedb2 with lr=3e-4 losses didn't seem to stabilize enough. Reduced lr to 2e-4 and now at 60k steps. Still noise but it sounds somewhat less garbage.

@BlackFox1197 I think we're probably getting similar results here. I will keep it running little bit further, but might need to go back to 0.2.3 to see if it seems better.

lucidrains commented 1 year ago

yeah I think it is fine now, just keep training as long as nothing explodes. djqualia said he trained to 10 days on A6000, for reference

djqualia commented 1 year ago

Full disclosure, that 10 day training was at version 0.2.x or below... please see the other related thread where I think something has regressed since 0.2.3 which is the best results I've had. I wish it was easy to test these different newer versions in parallel to pinpoint but it's rather time consuming. I'll continue to try and collect better A/B data and share here....

lucidrains commented 1 year ago

@djqualia oh noes

how sure are you that it regressed? the only other offending diff would be https://github.com/lucidrains/audiolm-pytorch/commit/8259b0d03ce0fe49d9dcc49ae29e0ccbb704e7bc maybe i should roll that back

lucidrains commented 1 year ago

the other thing is that anything prior to 0.3 had a bug for the discriminators, so i'm not sure how it could have learned in the past

lucidrains commented 1 year ago

@yigityu @BlackFox1197 @djqualia ok, decided to just bring back the complex stft discriminator in 0.7.3 could one of you run this and see if this is any better

if it does not, we'll just keep rolling back changes until it works again

lucidrains commented 1 year ago

i'll set aside some time to run experiments too, maybe this Sunday

djqualia commented 1 year ago

Sure thing, I started a new training run with 0.7.3 just now, will report back tomorrow...

lucidrains commented 1 year ago

@djqualia thank you!

djqualia commented 1 year ago

Just a quick check before I go to work (too soon to tell I think), but 0.7.3 appears to be a bit better. I'm only on step 720 with my config, but I'm hearing something resembling the source sounds behind some high pitch noise. Hopefully it keeps improving!

I'm also running a separate training on 0.7.2 (not apples to apples since it's separate hardware), to try and determine if the regression was fixed with that too (just in case)

lucidrains commented 1 year ago

@djqualia tremendous! keep us updated 🙏

ckwdani commented 1 year ago

I have 8000 steps now in 0.7.3 and its way better than 0.7.1 or 0.5.1 7_3_8000 I'm gonna try to hit 200,000 - 300,000 steps over the weekend, but it looks very good!

lucidrains commented 1 year ago

@BlackFox1197 🥳 🚀 hurray! time to democratize audio encoding 😄

however, this would also mean this issue is back to being unresolved 😢

lucidrains commented 1 year ago

i'll play around with these settings over the weekend

i got them from the way encodec approached their stft discriminator, but perhaps something wasn't done correctly. for now, we'll just stick with single machine complex stft discriminator

lucidrains commented 1 year ago

the encodec non-complex stft discriminator is also using weight normalization while I'm using gradient penalty

could be another reason why it isn't working

lucidrains commented 1 year ago

what would actually be best is if the pytorch team resolves their issue

djqualia commented 1 year ago

yep, i agree and can also confirm that 0.7.3 is doing better!

my comparison to 0.7.2 is less conclusive/scientific, but 0.7.2 does seem worse to me. i wonder if that means the change from 0.7.1->0.7.2 was unnecessary?

lucidrains commented 1 year ago

@djqualia woohoo!

i'm not sure re: 0.7.1 -> 0.7.2 @yigityu seems to have thought it became more stable, but we will probably have to just try the encodec activation configuration on the complex stft discriminator (and compare to the way it is now + complex stft discr)

lucidrains commented 1 year ago

I'm sort of in the "if it ain't broke don't try to fix it" stage 😂

ckwdani commented 1 year ago

ok unfortunately I have bad news. It looked fine at first, but After 18,000 steps this is what I get for an error: soundstream total loss: nan, soundstream recon loss: nan | discr (scale 1) loss: nan | discr (scale 0.5) loss: nan | discr (scale 0.25) loss: nan

and it started exploding after 16,000 steps In addition, after 12,000 steps it really degraded, 7_3_12000_bad And with the checkpoint at 18,000 steps, it does not event produce an real output anymore....

tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]])
lucidrains commented 1 year ago

@BlackFox1197 ahh yeah, so that is called divergence. could you get your loss curves and share those?

Try lowering the learning rate, or removing attention in the next run

lucidrains commented 1 year ago

that's fine though, it is basically working early on, not that concerned

lucidrains commented 1 year ago

@BlackFox1197 the typical thing we do when adversarial training is involved is to save checkpoints and pick the one just before it explodes

It still shouldn't explode this early, but just so you know

lucidrains commented 1 year ago

with the new https://google-research.github.io/seanet/musiclm/examples/, should probably just get my hands dirty and start training

basically same as what is in the repo

lucidrains commented 1 year ago

oh, they actually have a new audio clip model in here called MuLan... ok, gotta build that here then too (or separate repo)

Regardless, getting soundstream trained is a priorty now

yigityu commented 1 year ago

I'm at 25k steps with version 0.7.3 and seems good so far - still a lot of noise but stable training.

Looks like new challenges are piling up @lucidrains, let's get this soundstream trained :)

lucidrains commented 1 year ago

@yigityu yesss lets do this! stable diffusion moment for audio (sounds, speech, music, whatever) here we come!

lucidrains commented 1 year ago

soundstream is the linchpin

lucidrains commented 1 year ago

@yigityu is your training run on 0.7.3 with all the settings turned on? (attention and learned EMA?)

yigityu commented 1 year ago

Yes, those are all turned on.

lucidrains commented 1 year ago

@yigityu learning rate still at 2e-4 ? @BlackFox1197 how big is your training set? are you doing LibreSpeech?

yigityu commented 1 year ago

I'm keeping that 3e-4, but was planning to reduce it depending on how it goes.

djqualia commented 1 year ago

FWIW my training at 0.7.3 is still stable ~1.5 days in. I am using a learning rate of 3e-4 and effective batch size of 96 and data_max_length of 16000 or 10s, with everything else as it defaults in 0.7.3. In my experience, with lower batch sizes you might need to lower the learning rate...

+1 to the seizing the "stable diffusion moment" for audio, and the importance of Soundstream to enable all that here... thanks @lucidrains for your work!

lucidrains commented 1 year ago

@djqualia no problem! happy training everyone!

yigityu commented 1 year ago

I know this issue is closed - but I have some good news so wanted to give an update on my runs if it helps anyone:

My run with daeedb2 with lr reduced to 2e-4 seems stable now. 155k steps and I can hear a human voice in the background :)