soroushmehr / sampleRNN_ICLR2017

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
https://arxiv.org/abs/1612.07837
MIT License
535 stars 140 forks source link

Pretty amazing results from a training run on classical guitar music (single instrument), at epoch 5 & 7 #13

Closed LinkOne1A closed 7 years ago

LinkOne1A commented 7 years ago

https://soundcloud.com/user-637335781/sets/training-1-on-classical-guitar-music-single-instrument-at-epoch-5-7

Subjectivley speaking , sound quality is better than the results I got from training on the piano set.

WAV file generated at epoch 5 (~15k training samples) and epoch 7 (~20K training samples)

Single GPU : 8 GB GTX 1080 with 2560 CUDA Cores End of Epoch 7 at about 8 hours

Validation! Done!

>>> Best validation cost of 1.78753066063 reached. Testing! Done!
>>> test cost:1.8329474926  total time:60.4850599766
epoch:7 total iters:20498   wall clock time:7.04h
>>> Lowest valid cost:1.78753066063  Corresponding test cost:1.8329474926
    train cost:1.7714   total time:6.00h    per iter:1.054s
    valid cost:1.7875   total time:0.02h
    test  cost:1.8329   total time:0.02h
Saving params! Done!

Run command: THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python -u models/two_tier/two_tier.py --exp BEST_2TIER --n_frames 64 --frame_size 16 --emb_size 256 --skip_conn False --dim 1024 --n_rnn 3 --rnn_type GRU --q_levels 256 --q_type linear --batch_size 64 --weight_norm True --learn_h0 True --which_set MUSIC

Training done on several guitar passages from youtube, audio duration ~ 4 hours

// Private ref uuid : 6194a59d-b6ab-470d-95fd-6b43fe9b2daa

soroushmehr commented 7 years ago

Thanks for sharing. Good to know! :) People have tried it on different datasets, including Korean, classical music, and ambient. @richardassar even got interesting results from training on a couple of hours of Tangerine Dream works. See: https://soundcloud.com/psylent-v/tracks

LinkOne1A commented 7 years ago

Do you know if the generated sound (of t-dream) is purely by the network vs mixed with some supporting passages (such as drums) by a human?

richardassar commented 7 years ago

It's purely the network.

On 1 April 2017 at 08:16, LinkOne1A notifications@github.com wrote:

Do you know if the generated sound (of t-dream) is purely by the network vs mixed with some supporting passages (such as drums) by a human?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/soroushmehr/sampleRNN_ICLR2017/issues/13#issuecomment-290902105, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1ALODBAJwuZiHVvcYsr6UbjcIlutbDks5rrfnhgaJpZM4MuiF8 .

richardassar commented 7 years ago

@LinkOne1A These guitar samples are really nice

LinkOne1A commented 7 years ago

Credit to the N/network and folks with the sampelRNN paper!

I'm surprised that trainng on multi instrument data worked so well, I'm puzzled why that is. My intuition was that multiple instruments playing at the same time limit the space of ~valid~ ("pleasing" maybe is a better word) combinations (output), not sure how I would go about proving (or disprovinf this).

What has been your experience in this area?

richardassar commented 7 years ago

For Tangerine Dream the validation loss (in bits) was near to 3.2 after 300k iterations vs below 1.0 for solo piano. Note, I used the three_tier.py model in both cases.

It seems weight normalisation, both in the linear layers and the transformations inside the GRUs, helps with generalisation. I'm conducting some experiments to verify this for myself.

Some of SampleRNN's capacity to generalise might be due to the quantization noise introduced in going to 8 bits. It may be interesting to try something like https://arxiv.org/pdf/1701.06548.pdf to further improve generalisation however I've yet to observe an increase in validation loss so we're probably slightly underparameterised on these datasets.

Ignoring independent_preds the three_tier model has around 18 million parameters which is evidently sufficient to capture the temporal dynamics of the signal to an acceptable degree. If you think about it from an information theoretic point of view, kolmogorov complexity / minimum description length, there's a lot of redundancy in the signal that can be "compressed" away by the network.

The model seems to capture various synthesiser sounds, crowd noise from the live recordings, both synthetic and real drums including correct percussive patterns however it could not maintain a consistent tempo - this could be helped with some conditioning by an auxiliary time series.

If you used preprocess.py in datasets/music/ then you may want to run your experiments again. See: https://github.com/soroushmehr/sampleRNN_ICLR2017/pull/14

LinkOne1A commented 7 years ago

Thanks for the detailes! Interestring and surprising that loss validation of 3.2 produces the t/dream segment.

What was the length of the original (total) audio?

How long did it take to get to 300K steps, and what kinda GPU do you have?

The quantization noise (going 8 bit), are you refereing to the mu-law encode/decode? I ran a stand alone test of mu-law affected wav file vs original wav, could not hear the differance and an inverted/summation of the two sources in audacity showed very little amplitude, mostly I think on the high end of the spectrum.

I haven't thought about the descriptive complexity of the source signal. Now that you mention it, I'd say that if a network had to deal with already compressed data and had to figure out a predictive solution to the next likely outcome in a series, I don't know... but my hunch is that it would be more difficult for the network, which means we would need more complexity (layers) in the network. This is my off-the-cuff though!

I'll check it out ( #14 ).

I have not yet looked into the 3 tier training.

richardassar commented 7 years ago

Total audio was about 32 hours, although due to the bug I didn't end up training on all of it!

It seems that the required loss for acceptable samples is really relative to the dataset, multiple instruments increase the entropy of the signal and unlike piano it seems to get "lost" far less frequently because it has more of a varied space in which to recover. Before fully converging the piano samples sometimes go unstable, this effect was almost non-existent when training on Tangerine Dream.

No, I'm referring to quantization noise as x - Q(x) for any quantization scheme. Introducing noise of any kind acts as a regulariser e.g. https://en.wikipedia.org/wiki/Tikhonov_regularization

It's true that compressed data has more entropy but if you decompress the signal again, assuming lossy compression, the resulting entropy is lower than the original signal and should be easier to model. I was referring to the compressability of the signal, it seems there's plenty scope for that.

Something that would be interesting to try, akin to speaker conditioning as mentioned in the Char2Wav paper, is conditioning on instruments on genre with an embedded one-hot signal. This might allow interpretation between styles etc. This is an area of research I'll be looking into over the next while.

It took a couple of days to get to 300k steps, I'm training on a GTX1080. The machine I have has two but the script does not split minibatches over both GPUs. I have implemented SampleRNN myself and it can train 4x faster (without GRU weightnorm), soon to be released.

richardassar commented 7 years ago

Although I have avoided it so far it's probably worth filtering out low amplitude audio segments from the training set. These get amplified during normalization which pulls up the noise floor introducing lots of high energy noise which can only disrupt or slow down training.

A plot of the RMS power over each segment in the piano database shows the distribution and you can see the tail of low energy signals on the right which could probably be pruned (especially the one segment with zero energy).

rms_powers