rncm-prism / prism-samplernn

Neural sound synthesis with TensorFlow 2
MIT License
114 stars 20 forks source link

Seed parameter while generating #6

Open bstivic opened 4 years ago

bstivic commented 4 years ago

Hi,

I have problem with generating audio from the seed audio. As i understood, when we provide seed audio file, generation continues where after 64 samples of seed and it should point generation in some other directions than default? I am trying to seed because I get almost identical or very similar generated audio results (epoch 150, dataset 30min, 22050Hz) with different training parameters every time.

Error that I get when trying to seed:

!python generate.py \ --output_path ./generated/default/test_1_t075_s10_16000.wav \ --checkpoint_path ./logdir/default/26.09.2020_12.35.35/model.ckpt-140 \ --seed ./chunks/chunk_22050_mono_norm_chunk_109.wav \ --dur 10 \ --sample_rate 16000 \ --temperature 0.75 \ --num_seqs 100 \ --config_file ./default.config.json

Traceback (most recent call last): File "generate.py", line 225, in main() File "generate.py", line 221, in main args.sample_rate, args.temperature, args.seed, args.seed_offset) File "generate.py", line 188, in generate init_samples[:, :model.big_frame_size, :] = quantize(seed_audio, q_type, q_levels) TypeError: 'tensorflow.python.framework.ops.EagerTensor' object does not support item assignment

One more question: Is the sample rate differences maybe the problem? Does it have to be a same sample rate in training, generation and seed audio?

Best regards, Branimir

relativeflux commented 4 years ago

Ah yes, thanks for spotting this - the error looks like an issue caused by trying to assign to a tensor, which is not possible since they're immutable in TensorFlow... It was originally a Numpy array, and they are mutable. Apologies, I'll get this fixed.

Actually I'm not sure what would be the result of having different sample rates during training and generation. With regard to the effect of seeding, I suspect it needs a larger chunk of samples for it to have much effect... I need to look at that feature again, sorry haven't had any time to work on it.

bstivic commented 4 years ago

Thank you for the super-fast reply! Ok, i finally have some improvements after expanding dataset to ~4 hours, and results are sounding promising after just 35 epochs :). Do you have idea how training can be done on just one song ~4min, like generating results just in style of one song?

Its maybe possible that different sample rates can speed up or slow down generated audio (something similar happened in the training process with MelGan), so I switched to 16kHz on every parameter.

relativeflux commented 4 years ago

Thanks - yes, a 4hr dataset should yield something useful... 4 minutes unfortunately is unlikely to be practical, you're welcome to try it but my guess is that the model will simply overfit, meaning it will basically just memorize your dataset. Incidentally, in terms of getting a better training workflow we will shortly be releasing a model tuner/optimizer with the code... It's based on Keras Tuner, and there is already a branch available, although it is experimental and buggy at the moment, with no documentation on the tuner. I hope to merge this within the next week or so, I'm testing the implementation on some large datasets now. So instead of blindly picking some hyperparameters and hoping for the best, the tuner will allow users to find the optimal hyperparameters for a dataset, then proceed to a full training session with those hyperparams. Still likely to be more of an art than a science (which is not a bad thing!), but better than blindly stabbing in the dark!

DigestContent0 commented 3 years ago

Do you have idea how training can be done on just one song ~4min, like generating results just in style of one song?

An actual 4 minute-long file would not work and only produce meaningless noise. Perhaps repeating the audio until it reaches more than 25 minutes would work?

relativeflux commented 3 years ago

An actual 4 minute-long file would not work and only produce meaningless noise. Perhaps repeating the audio until it reaches more than 25 minutes would work?

@DigestContent0 @bstivic Indeed that would work, perhaps a better solution would be some kind of data augmentation (although I still suspect you'd need more raw data than a single 4-min track). This is often used when working with images. I have recently been investigating this very issue, with a view to including a data augmentation script in a future release. I've been experimenting with audiomentations, which looks promising.

Dadabots claim that they got good results from 3200 chunks (overlapped). Having worked with datasets of a few hundred chunks I can confirm that, whilst you might be able to achieve good training accuracy, validation accuracy indicates classic overfitting after a few epochs (that's on the validate branch, which I am hoping to merge to master very shortly).

I've added a gist for using audiomentations on a directory of wav files.