lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.36k stars 255 forks source link

typical data length and train steps of the transformers? #101

Closed cyanbx closed 1 year ago

cyanbx commented 1 year ago

Hi, thanks for the great work.

I have get reconstructed human voice from the trained SoundStream, but failed to get speech from the combined AudioLM (only electronic noise). And I guess my semantic and acoustic transformers may not get trained properly. I wonder what are the data_max_length and num_train_steps we should use to train the three transformers to get understandbale speech?

lucidrains commented 1 year ago

get in touch with @eonglints , as he had successfully trained the semantic transformer. not only that, but he is a young researcher in the field, who in my mind will eventually become an expert

lucidrains commented 1 year ago

you should update to the latest version and use data_max_length_seconds

as for num_train_steps, the answer is for as long as possible, with as much data as possible, without overfitting

cyanbx commented 1 year ago

you should update to the latest version and use data_max_length_seconds

as for num_train_steps, the answer is for as long as possible, with as much data as possible, without overfitting

Yeah, definitely more training steps will be better, but I wonder the lower bound of training steps to get intelligible speech with the acoustic transformers.

lucidrains commented 1 year ago

@cyanbx the lower bound would be at least one epoch over your entire dataset, and you need to have a dataset that is in the millions. for transformers, parameter count needs to be sufficient; estimate this should be at least 700 million - 2 billion range. but you should see some signal even at 200 million parameters and maybe a dataset of 100k