Closed cyanbx closed 1 year ago
get in touch with @eonglints , as he had successfully trained the semantic transformer. not only that, but he is a young researcher in the field, who in my mind will eventually become an expert
you should update to the latest version and use data_max_length_seconds
as for num_train_steps
, the answer is for as long as possible, with as much data as possible, without overfitting
you should update to the latest version and use
data_max_length_seconds
as for
num_train_steps
, the answer is for as long as possible, with as much data as possible, without overfitting
Yeah, definitely more training steps will be better, but I wonder the lower bound of training steps to get intelligible speech with the acoustic transformers.
@cyanbx the lower bound would be at least one epoch over your entire dataset, and you need to have a dataset that is in the millions. for transformers, parameter count needs to be sufficient; estimate this should be at least 700 million - 2 billion range. but you should see some signal even at 200 million parameters and maybe a dataset of 100k
Hi, thanks for the great work.
I have get reconstructed human voice from the trained SoundStream, but failed to get speech from the combined AudioLM (only electronic noise). And I guess my semantic and acoustic transformers may not get trained properly. I wonder what are the
data_max_length
andnum_train_steps
we should use to train the three transformers to get understandbale speech?