lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.32k stars 249 forks source link

Soundstream Training Goes From Great to Horrible #221

Open adamfils opened 11 months ago

adamfils commented 11 months ago

I have been training soundstream for the past 3 days on my A6000. At 25,000 steps I got amazing results then after that the loss just increased abruptly and other generations are just bad. As you can see below from step 25031 the loss looks weird and increases.

At 25,000 steps here is the result https://voca.ro/1c10gpytA3id

At 25,500 here is the result https://voca.ro/1eaoQiOmo1Se

25000: saving to results 25000: saving model to results 25001: soundstream total loss: 4.872, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.284 | discr (scale 0.5) loss: 1.894 | discr (scale 0.25) loss: 1.829 25002: soundstream total loss: 4.893, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.336 | discr (scale 0.5) loss: 1.899 | discr (scale 0.25) loss: 1.887 25003: soundstream total loss: 4.375, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.332 | discr (scale 0.5) loss: 1.825 | discr (scale 0.25) loss: 1.871 25004: soundstream total loss: 4.699, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.229 | discr (scale 0.5) loss: 1.879 | discr (scale 0.25) loss: 1.921 25005: soundstream total loss: 4.486, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.217 | discr (scale 0.5) loss: 1.859 | discr (scale 0.25) loss: 1.928 25006: soundstream total loss: 4.232, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.296 | discr (scale 0.5) loss: 1.842 | discr (scale 0.25) loss: 1.934 25007: soundstream total loss: 4.356, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.056 | discr (scale 0.5) loss: 1.939 | discr (scale 0.25) loss: 1.930 25008: soundstream total loss: 4.532, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.011 | discr (scale 0.5) loss: 1.965 | discr (scale 0.25) loss: 1.964 25009: soundstream total loss: 4.534, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.065 | discr (scale 0.5) loss: 2.011 | discr (scale 0.25) loss: 2.013 25010: soundstream total loss: 4.773, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.297 | discr (scale 0.5) loss: 2.198 | discr (scale 0.25) loss: 2.055 25011: soundstream total loss: 4.817, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.109 | discr (scale 0.5) loss: 2.110 | discr (scale 0.25) loss: 2.033 25012: soundstream total loss: 5.056, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.398 | discr (scale 0.5) loss: 2.042 | discr (scale 0.25) loss: 1.931 25013: soundstream total loss: 5.122, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.212 | discr (scale 0.5) loss: 1.955 | discr (scale 0.25) loss: 1.865 25014: soundstream total loss: 4.553, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.231 | discr (scale 0.5) loss: 1.909 | discr (scale 0.25) loss: 1.913 25015: soundstream total loss: 4.360, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.232 | discr (scale 0.5) loss: 1.847 | discr (scale 0.25) loss: 1.952 25016: soundstream total loss: 4.644, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.279 | discr (scale 0.5) loss: 1.803 | discr (scale 0.25) loss: 1.994 25017: soundstream total loss: 5.561, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.278 | discr (scale 0.5) loss: 1.807 | discr (scale 0.25) loss: 1.943 25018: soundstream total loss: 4.956, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.209 | discr (scale 0.5) loss: 1.713 | discr (scale 0.25) loss: 1.878 25019: soundstream total loss: 5.055, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.179 | discr (scale 0.5) loss: 1.732 | discr (scale 0.25) loss: 1.865 25020: soundstream total loss: 5.168, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.332 | discr (scale 0.5) loss: 1.762 | discr (scale 0.25) loss: 1.853 25021: soundstream total loss: 4.924, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.375 | discr (scale 0.5) loss: 1.813 | discr (scale 0.25) loss: 1.867 25022: soundstream total loss: 4.844, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.462 | discr (scale 0.5) loss: 1.786 | discr (scale 0.25) loss: 1.855 25023: soundstream total loss: 5.200, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.579 | discr (scale 0.5) loss: 1.798 | discr (scale 0.25) loss: 1.822 25024: soundstream total loss: 7.380, soundstream recon loss: 0.002 | discr (scale 1) loss: 2.756 | discr (scale 0.5) loss: 1.805 | discr (scale 0.25) loss: 1.813 25025: soundstream total loss: 4.865, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.723 | discr (scale 0.5) loss: 1.748 | discr (scale 0.25) loss: 1.758 25026: soundstream total loss: 4.889, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.725 | discr (scale 0.5) loss: 1.854 | discr (scale 0.25) loss: 1.846 25027: soundstream total loss: 5.056, soundstream recon loss: 0.001 | discr (scale 1) loss: 2.747 | discr (scale 0.5) loss: 1.817 | discr (scale 0.25) loss: 1.854 25028: soundstream total loss: 5.091, soundstream recon loss: 0.001 | discr (scale 1) loss: 3.242 | discr (scale 0.5) loss: 1.839 | discr (scale 0.25) loss: 1.891 25029: soundstream total loss: 4.385, soundstream recon loss: 0.001 | discr (scale 1) loss: 8.894 | discr (scale 0.5) loss: 1.760 | discr (scale 0.25) loss: 1.883 25030: soundstream total loss: 2.860, soundstream recon loss: 0.001 | discr (scale 1) loss: 108.547 | discr (scale 0.5) loss: 1.708 | discr (scale 0.25) loss: 1.798 25031: soundstream total loss: -15.905, soundstream recon loss: 0.002 | discr (scale 1) loss: 1718.587 | discr (scale 0.5) loss: 1.557 | discr (scale 0.25) loss: 1.979 25032: soundstream total loss: -303.631, soundstream recon loss: 0.024 | discr (scale 1) loss: 10940.722 | discr (scale 0.5) loss: 1.072 | discr (scale 0.25) loss: 3.398 25033: soundstream total loss: -2264.270, soundstream recon loss: 0.295 | discr (scale 1) loss: 234567.777 | discr (scale 0.5) loss: 0.180 | discr (scale 0.25) loss: 5.426 25034: soundstream total loss: -53273.180, soundstream recon loss: 15.740 | discr (scale 1) loss: 1108289.203 | discr (scale 0.5) loss: 0.008 | discr (scale 0.25) loss: 0.970 25035: soundstream total loss: -244286.930, soundstream recon loss: 272.947 | discr (scale 1) loss: 3089418.844 | discr (scale 0.5) loss: 0.010 | discr (scale 0.25) loss: 0.029 25036: soundstream total loss: -648283.398, soundstream recon loss: 2447.980 | discr (scale 1) loss: 7947847.062 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.007 25037: soundstream total loss: -1452483.922, soundstream recon loss: 19413.394 | discr (scale 1) loss: 18546006.250 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.001 25038: soundstream total loss: -2364417.562, soundstream recon loss: 132011.410 | discr (scale 1) loss: 33489656.000 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.008 25039: soundstream total loss: 2783657.594, soundstream recon loss: 803092.328 | discr (scale 1) loss: 49849376.000 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.002 25040: soundstream total loss: 14825873.875, soundstream recon loss: 2074252.219 | discr (scale 1) loss: 38289075.500 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.002 25041: soundstream total loss: 11907697.250, soundstream recon loss: 1596693.234 | discr (scale 1) loss: 12477728.375 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.016 25042: soundstream total loss: 2389267.781, soundstream recon loss: 358384.961 | discr (scale 1) loss: 1455136.312 | discr (scale 0.5) loss: 0.000 | discr (scale 0.25) loss: 0.009 25043: soundstream total loss: 47939.899, soundstream recon loss: 14739.114 | discr (scale 1) loss: 37.778 | discr (scale 0.5) loss: 63.086 | discr (scale 0.25) loss: 52.932 25044: soundstream total loss: 847.260, soundstream recon loss: 2.112 | discr (scale 1) loss: 15.008 | discr (scale 0.5) loss: 60.966 | discr (scale 0.25) loss: 115.134 25045: soundstream total loss: 936.149, soundstream recon loss: 0.910 | discr (scale 1) loss: 16.900 | discr (scale 0.5) loss: 5.909 | discr (scale 0.25) loss: 0.893 25046: soundstream total loss: 401.222, soundstream recon loss: 0.256 | discr (scale 1) loss: 18.919 | discr (scale 0.5) loss: 6.674 | discr (scale 0.25) loss: 0.226 25047: soundstream total loss: 172.702, soundstream recon loss: 0.054 | discr (scale 1) loss: 18.073 | discr (scale 0.5) loss: 4.793 | discr (scale 0.25) loss: 0.570

adamfils commented 11 months ago

My training code

`from audiolm_pytorch import SoundStream, SoundStreamTrainer

soundstream = SoundStream( codebook_size=1024, rq_num_quantizers=8, rq_groups=2, # 2 groups of quantizers attn_window_size=128, # local attention receptive field at bottleneck attn_depth=2

2 local attention transformer blocks - the soundstream folks were not experts with attention, so i took the liberty to add some. encodec went with lstms, but attention should be better

)

trainer = SoundStreamTrainer( soundstream, folder='/home/user/Downloads/LibriSpeech', batch_size=8, grad_accum_every=8, # effective batch size of 32

data_max_length=320 * 32,

# lr=2e-6,
data_max_length_seconds=3,
save_model_every=1000,
save_results_every=500,
num_train_steps=10000001

).cuda() trainer.train()`

lucidrains commented 11 months ago

@adamfils try loading from the checkpoint just before the collapse, and lowering the learning rate

adamfils commented 11 months ago

Thanks. Also what is the difference between sample_31500.flac and sample_31500.ema.flac (EMA and non EMA Audio Samples). Which should I use to measure the performance of the soundstream model? @lucidrains

lucidrains commented 11 months ago

@adamfils you want to use the ema version, which stands for exponential moving average

this is a common practice in generative field, where you update the parameters of your generator with exponential smoothing, which often leads to better end models

adamfils commented 11 months ago

Okay. because the ema samples sound bad while the non Ema audio samples sound great. 😬 I'm at 38,000 steps and have been training for about 6 days now. What tweaks would you suggest? @lucidrains

lucidrains commented 11 months ago

@adamfils yikes, that doesn't sound good! let me check on this maybe this sunday morning

Fritskee commented 10 months ago

Any updates on how this got fixed? Want to start a training as well in the coming week.

lucidrains commented 10 months ago

multiple engineers and researchers have already successfully trained

you should just go for it, if you have enough data

lucidrains commented 10 months ago

@Fritskee my next stretch goal is to turn the soundstream training into a CLI, like what i did for lightweight gan

Fritskee commented 10 months ago

multiple engineers and researchers have already successfully trained

you should just go for it, if you have enough data

I just wanted to go with LibriSpeech, so I figured if the weights were already out there, might as well ask. But you make a fair point!

Fritskee commented 10 months ago

@Fritskee my next stretch goal is to turn the soundstream training into a CLI, like what i did for lightweight gan

That'd be dope!

I also want to take the time to thank you for all your efforts to democratize the latest research in ML!