typical range of `num_train_steps`?

naotokui commented 1 year ago

Thanks for sharing this great repo!

I'm wondering what is the typical range of num_train_steps for a SoundStream model and others. I tested with 10000 and saw the loss went down somewhat smoothly, but it did not generate any meaningful results (very noisy).

djqualia commented 1 year ago

what version are you using? what is your batch size? i suspect you'll need to train more than 10k steps, however, there might be a problem with the recent versions...

@lucidrains i'm still testing and was going to wait until i had more data before posting, but i'm worried the complex stft discriminator might need to come back. i've moved onto a larger training set (so not quite the same as before), and i was 2 days into training on 0.10.3 and was still getting mostly noise, so i'm now testing at 0.7.8 (the last version before you removed it) with use_complex_stft_discriminator = True. i'm only half a day in, but the loss curves are looking better (trending down vs. up) and samples seem less noise (but more time is needed to tell there). i'll report back in time...

lucidrains commented 1 year ago

what version are you using? what is your batch size? i suspect you'll need to train more than 10k steps, however, there might be a problem with the recent versions...

@lucidrains i'm still testing and was going to wait until i had more data before posting, but i'm worried the complex stft discriminator might need to come back. i've moved onto a larger training set (so not quite the same as before), and i was 2 days into training on 0.10.3 and was still getting mostly noise, so i'm now testing at 0.7.8 (the last version before you removed it) with use_complex_stft_discriminator = True. i'm only half a day in, but the loss curves are looking better (trending down vs. up) and samples seem less noise (but more time is needed to tell there). i'll report back in time...

oh bummer, i really need help on this one :cry:

ok, i've reverted it for now

naotokui commented 1 year ago

@djqualia @lucidrains thanks for your comments! I tried with batch_size = 4 and 8. I'll try again with the reverted version and see how it goes!

djqualia commented 1 year ago

Update, I tested with 0.11.1 and can confirm the samples are MUCH better (I can already hear the source in some samples, <1 day in to training). The total loss was much higher than previous versions, but is generally trending down.

lucidrains commented 1 year ago

@djqualia that's great! thank you! btw I like your name 🤣

turian commented 1 year ago

@djqualia definitely has the best name

LWprogramming commented 1 year ago

Was able to replicate @djqualia's result with 0.11.1 :)

Here's a rough sketch of the total loss over time (unfortunately, had some weird issues with print statement issues and whatever so this doesn't cover the full training run but you get the general idea. 20k training steps, batch_size=4 grad_accum_every = 8, OpenSLR dev-clean dataset )

BTW, the end result generally contains things that I recognize as sounding like human speech but if you ran it on 0.25x (sort of like a robot):

https://user-images.githubusercontent.com/13173037/218032553-f3775870-037e-4271-aca7-c9c5e14787e4.mp4

(I hear "rarely succeed")

Is this to be expected and hopefully would get better with more training, or is it more likely I did something wrong with running SoundStream e2e?

lucidrains commented 1 year ago

my first thought https://www.youtube.com/watch?v=uc6f_2nPSX8 🤖🎵

djqualia commented 1 year ago

I think the "robot sound" is probably expected this early on in training. It's akin to bit-reduction effects in music production (as soundstream builds up the RVQ network). My early samples trained on music have a similar effect, which has gone away with more training. Example:

https://user-images.githubusercontent.com/5771497/218266970-9029f852-7f1a-430c-930f-d9327353228c.mp4

aTylerRice commented 1 year ago

I'm having trouble with understanding what I'm doing wrong... I left the data_max_length to 320 * 32... I don't understand with these numbers mean after looking through the code base. The data I'm using is all 30s clips, but for some reason I'm only getting less than a second for the samples that gets spit out at the end of each save results... Is there documentation anywhere on what this data_max_length means as I feel it's my problem.

Edit: I think this should maybe be added to the Readme somewhere. I guess this data_max_length corresponds to the size torchaudio gives you.... so 30 seconds of 24000 hz would be data_max_length 24000 * 30. This seems right?

ckwdani commented 1 year ago

@djqualia I trained for about 60k steps with batch size 4 and the robotic sound did not seem to go away or get better after 20k steps. However, the results were pretty good after a day for me too (~15k steps). I started a new run with version 12.1 and will keep you posted on the results. :)

smcio commented 1 year ago

Sounds like some decent progress thus far 👍

@djqualia - may I ask what dataset you trained upon to get that ten-second clip you posted above? And, in addition, how many steps had you trained over when that clip was emitted, and what 'data_max_length' value did you use when training upon that dataset?

smcio commented 1 year ago

@aTylerRice Did you ever reach a conclusion on your question above Re: data_max_length? I'm also training on 30s clips, and my save results are also <1s. Thanks.

djqualia commented 1 year ago

re: data_max_length, yes you will need to set this to your desired sample length based on your sample_rate. the default value is low i suspect for testing and also because most GPUs can't handle a large value

for my sample above, i'm being a bit ambitious and attempting to train a general music soundstream model at 44100 sample rate. to achieve this, i've adjusted the stride factor from 320 -> 882 (using 3,6,7,7), based on my limited understanding of the research papers. with my gpu, i can only support a puny batch size of 2. data_max_length is 10 x 44100 for 10s. this is with 12 RVQ layers and codebook 1024 (as musiclm did, but they used 24000 sample rate)

i'm now 2 days into training on 0.12.1 (at step 41500), and there's still robot/bit-reduction effect on most samples (some more than others), but it's getting better. TBD on whether this goes away with time (in my past data set, it was only bass-instrument sounds, which perhaps enabled it to converge faster?)

initially, my data set was my very large collection of music+samples (i have DJed in the past and still produce music). i've been augmenting it with every free music database i can find (fma, jamendo, audioset). i've probably spent more time writing code to gather/organize datasets than anything else so far :-)

djqualia commented 1 year ago

p.s. i would recommend reading and trying to understand the associated research papers (soundstream, audiolm, hubert, musiclm). this all state of the art so it's not an easy thing to get working (yet)

aTylerRice commented 1 year ago

@djqualia I still had these robotic sounds after training on 15s clips on about 80k samples from fma dataset.. I'm guessing google either had many more samples to train on and maybe the model is even slightly different. It's a sad state that research is so hard to reproduce. I'm going to try training with a sample size of 10s next but using all the audio from fma large so essentially 300k examples.. This is going to wait though as my machine learning computer is currently in pieces due to an upgrade

smcio commented 1 year ago

@djqualia Thanks very much indeed for the info. I've set off a new training run some ten minutes ago with data_max_length set at 240000, as per 24kHz * 10 seconds. With this much higher data_max_length, I had to drop batch_size down to 2 (with grad_accum_every at 8) in order to not get an OOM thrown straight back. I've also adjusted rq_num_quantizers from 8 -> 12 just to see if that helps anything.

My dataset at present is just fma_large, so I'm training on a dataset of over 100k samples atm. I'm also using the 0.12.1 tag.

I've just got two quick questions if you find a moment, please 🙂 RE: i've adjusted the stride factor from 320 -> 882 (using 3,6,7,7), based on my limited understanding of the research papers - what parameter are we talking about here? I couldn't find any stride parameter set at 320. Secondly, might I ask what GPU and setup you're using? I've just set off the training run per my descriptions above on an A100 (lambda labs), and by the looks of things I'm sure not going to hit step 41500 in two days of training! 🙂

Thanks for your help and pointers - they're very much appreciated 🙂

djqualia commented 1 year ago

the stride factor is a series of convolutions that reduces the dimensionality of the incoming audio signal. the default in this repository, and what was used in the original audiolm paper, was (2, 4, 5, 8). 2 x 4 x 5 x 8 = 320. this results in embeddings at 50hz (every 20ms) when the sampling rate is 16000

i believe this factor of 320 needs to be scaled to other sampling rates. in the musiclm paper, they used a stride factor of 480 for a sampling rate of 24000, which is proportional. thus, for 44100 i computed a stride factor of 882 was needed (44100 / 16000 * 320). i then determined the factors of 882 to come up with a set of strides that resulted in this...

my GPU is an expensive NVIDIA RTX A6000 w/48 gb ram...

hth!

smcio commented 1 year ago

@djqualia thanks again for the info above - very much appreciated.

Something occurred to me in re-reading your posts above, and I was wondering if I might clarify something with you if possible. Tagging @lucidrains in as well because it's related to the current codebase itself, so I'd be able to put up a PR if the below is correct.

In the SoundStream class' constructor (__init__), the default value for target_sample_hz is 24000, and the default value for strides is (2, 4, 5, 8), i.e.

class SoundStream(nn.Module):
    def __init__(
        self,
        *,
        channels = 32,
        strides = (2, 4, 5, 8),

...

        target_sample_hz = 24000

...
    ):

In your post above, you recall that 320 (i.e. (2, 4, 5, 8)) is the strides value used in the AudioLM paper when the sampling rate is 16kHz, and a strides value of 480 is used in MusicLM when the sampling rate is 24kHz. Does this mean that the default strides value of (2, 4, 5, 8) and the default sampling rate (target_sample_hz = 24000) for the SoundStream class aren't quite compatible with one another (hence strides should be changed if you intend to train with sampling rate as 24000)?

Thanks again for your time & work here everyone.

Best, Shaun

Afiyetolsun commented 1 year ago

sampling rate is 16kHz, and a strides value of 480 is used in MusicLM when the sampling rate is 24kHz.

Yes you're right, for 16 kHz used in the article SoundStream/AudioLM strides (2, 4, 5, 8) = 320, and for 24 kHz article MusicLM (3, 4, 5, 8) = 480

PS Screenshot of the audio file from the original SoundStream

smcio commented 1 year ago

Thanks for confirming @Afiyetolsun 👍

amitaie commented 1 year ago

sampling rate is 16kHz, and a strides value of 480 is used in MusicLM when the sampling rate is 24kHz.

Yes you're right, for 16 kHz used in the article SoundStream/AudioLM strides (2, 4, 5, 8) = 320, and for 24 kHz article MusicLM (3, 4, 5, 8) = 480

PS Screenshot of the audio file from the original SoundStream

What do you mean in "audio file from the original SoundStream"? from where you took this audio?

Afiyetolsun commented 1 year ago

from where you took this audio?

https://google-research.github.io/seanet/audiolm/examples/ "Original" from Audiolm you can download the audio files and check the sampling rate. In the audiolm article it says 16kHz and in the files as well, everything makes sense :D

And in MusicLm 24kHz

amitaie commented 1 year ago

OK, i though you meant original from SoundStream, and in SoundStream paper they actually used 24000 sample rate.

Afiyetolsun commented 1 year ago

SoundStream paper they actually used 24000 sample rate.

Sorry for the confusion! :)

sohananisetty commented 1 year ago

@djqualia Thanks very much indeed for the info. I've set off a new training run some ten minutes ago with data_max_length set at 240000, as per 24kHz * 10 seconds. With this much higher data_max_length, I had to drop batch_size down to 2 (with grad_accum_every at 8) in order to not get an OOM thrown straight back. I've also adjusted rq_num_quantizers from 8 -> 12 just to see if that helps anything.

My dataset at present is just fma_large, so I'm training on a dataset of over 100k samples atm. I'm also using the 0.12.1 tag.

I've just got two quick questions if you find a moment, please slightly_smiling_face RE: i've adjusted the stride factor from 320 -> 882 (using 3,6,7,7), based on my limited understanding of the research papers - what parameter are we talking about here? I couldn't find any stride parameter set at 320. Secondly, might I ask what GPU and setup you're using? I've just set off the training run per my descriptions above on an A100 (lambda labs), and by the looks of things I'm sure not going to hit step 41500 in two days of training! slightly_smiling_face

Thanks for your help and pointers - they're very much appreciated slightly_smiling_face

I have also started training on fma_large + MagnaTagATune data for a total of 125k data points. I have trained for about 6k steps but it is still mostly noise and the loss is sort of stuck between 1200-1300 (with STFT normalization enabled). Have you gotten better results or are things the same for you? I have a stride factor of 320, 8 quantizers at 24KHz. I am training on 2 A40 48GB GPUs with a batch size of 8, grad_accum 8, TF32 enabled, and an input length of 2*24000.

yygle commented 1 year ago

I tried to train with the clean part of librispeech (100 + 360) for 12k steps, and it seems still robot sound there. data_max_length is 16000 * 2, with 16kHZ. Any suggestions?

ZhihaoDU commented 1 year ago

@yygle I have the same problem, have you solved it?

turian commented 1 year ago

@djqualia do you mind sharing some of your training curves? I'm wondering what the discriminator losses look like. e.g. do they drop low, then increase and oscillate? Suggesting the generator has caught up

hyhzl commented 1 year ago

@lucidrains i wanna train voice clone of chinese mandarin, and i plan to change soundstream to encodec instead. and i found chinese hubert model ,trained by weNetSpeech,10k+ hours. but k-means is not provided, which is required by project, the project require n_clusters and clusters_center, what should i do to continue to my exploration about voice clone? @lucidrains

hyhzl commented 1 year ago

@lucidrains i wanna train voice clone of chinese mandarin, and i plan to change soundstream to encodec instead. and i found chinese hubert model ,trained by weNetSpeech,10k+ hours. but k-means is not provided, which is required by project, the project require n_clusters and clusters_center, what should i do to continue to my exploration about voice clone? @lucidrains

https://huggingface.co/TencentGameMate/chinese-hubert-base/tree/main

WinterStraw commented 8 months ago

but k-means is not provided, which is required by project, the project require n_clusters and clusters_center, what should i do to continue to my exploration about voice clone?

I encountered a similar problem, could you please share whether you found a good solution now? Thanks a lot! @hyhzl

lucidrains / audiolm-pytorch

typical range of `num_train_steps`? #80