Questions about training Soundstream: poor intelligibility and gradients explosion after 10k steps. (sr=16k, B=96)

Very nice repo! Thank you authors for your contribution.

And here is my situation: I have been trying to use about 20000 hours of open-source speech data to follow this repo (version 1.2.7) and start training Soundstream from scratch. I basically made no changes to this repo except setting the batch_size as:

    trainer = SoundStreamTrainer(
        soundstream,
        audio_path_list=audio_path_list,
        batch_size=12,
        grad_accum_every=8,  # effective batch size of 12*8==96
        data_max_length_seconds=2,  # train on 2 second audio
        num_train_steps=1_000_000,
    ).cuda()

I have been running this on 4xA100 GPUs for a couple of days, and after it went over 10k steps, this kind of audio was obtained. There are some signs of speech formation, but noise is heavy. The total loss has always been around ~20, and maybe gradually decreasing to ~10. Based on my experience in training vocoders such as HIFIGAN/WAVEGAN, I think that the number of training steps may not be enough, and the high-frequency information has not been learned. However I am newbie in large model training so I'm not quite confident if I'm on the right track. Do I just need more training steps or perhaps something has went wrong?

If anyone has met with/solved a similar problem, please share some information.

8k steps:

9k steps:

And the gradients just went out of control after 10500 steps. I think it definitely failed, but doesn't know the reasons.

10.5k steps:

lucidrains / audiolm-pytorch

Questions about training Soundstream: poor intelligibility and gradients explosion after 10k steps. (sr=16k, B=96) #204

this paper proposes using multi-headed residual vector quantization - https://arxiv.org/abs/2305.02765