lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.33k stars 249 forks source link

Questions about training Soundstream: poor intelligibility and gradients explosion after 10k steps. (sr=16k, B=96) #204

Open Makiyuyuko opened 1 year ago

Makiyuyuko commented 1 year ago

Very nice repo! Thank you authors for your contribution.

And here is my situation: I have been trying to use about 20000 hours of open-source speech data to follow this repo (version 1.2.7) and start training Soundstream from scratch. I basically made no changes to this repo except setting the batch_size as:

    trainer = SoundStreamTrainer(
        soundstream,
        audio_path_list=audio_path_list,
        batch_size=12,
        grad_accum_every=8,  # effective batch size of 12*8==96
        data_max_length_seconds=2,  # train on 2 second audio
        num_train_steps=1_000_000,
    ).cuda()

I have been running this on 4xA100 GPUs for a couple of days, and after it went over 10k steps, this kind of audio was obtained. There are some signs of speech formation, but noise is heavy. The total loss has always been around ~20, and maybe gradually decreasing to ~10. Based on my experience in training vocoders such as HIFIGAN/WAVEGAN, I think that the number of training steps may not be enough, and the high-frequency information has not been learned. However I am newbie in large model training so I'm not quite confident if I'm on the right track. Do I just need more training steps or perhaps something has went wrong?

If anyone has met with/solved a similar problem, please share some information.

8k steps: image

9k steps: image

And the gradients just went out of control after 10500 steps. I think it definitely failed, but doesn't know the reasons. image

10.5k steps: image

Makiyuyuko commented 1 year ago

Additional information: soundstream = SoundStream( codebook_size=1024, rq_num_quantizers=8, rq_groups=2,

this paper proposes using multi-headed residual vector quantization - https://arxiv.org/abs/2305.02765

    attn_window_size=128,  # local attention receptive field at bottleneck
    attn_depth=2
) 
lr= 2e-4  

I didn't change these, should I?)