What is the rule of thumb for generating music with wavenet?

pennygalaxi commented 7 years ago

I have a large library of mp3 songs. What is the best way to process these songs in order to get a good result with wavenet?

So far, I tried the following approach which doesn't seem to work very well: convert mp3s to wav files (16 bits per sample) and then run the training script (with default parameters). My questions are as follows:

The mp3 files range between 3mb to 10mb (3 mins to 8 mins songs). Should I chunk those into smaller files of e.g. 30 seconds?
Do you think the songs should be instrumental only or is it okay to have voices? (the songs I am using right now have voices and the output sounds very noisy)
Do you have any ideas on how many songs/samples are considered as bare minimum for the training data? (just a ballpark estimate)

Please feel free to add any comments that you think its relevant to get a good result. Thanks a lot!

veqtor commented 7 years ago

Depending on what you're trying to train on, it might be that the net is trying to find a "universal" solution that just doesn't exists without local and/or global conditioning being introduced. A longer receptive field might solve some issues, current settings yield a receptive field of ~250 msec @ 16khz

At what step have you given up so far?

Try music that has the same instrumentation, genre and tempo, it might help. I think it wasn't a coincidence that Google chose to train on classical piano music. The network would then, in terms of sound generation, only have to build a representation of a piano.

I'd also suggest actually listening to whatever the network has found so far, it might be that it's getting close to some kind of solution. If you have very diverse training material, maybe it needs to train for a REALLY long time. I tried training on 8-bit music and reached some kind of simple solution quite quickly, speech seems to need around 88k steps. We don't know what is needed to generate (especially with complex timbres, variations on instruments such as electric guitar with effects, mastering affecting the inter-dynamics of instruments and so on and so on). The Wavenet paper quickly mentions having global conditioning describing genre, tempo and some other inputs when training on arbitrary music sets.

neale commented 7 years ago

@veqtor is right with respect to training time. I used a 1070 and tried for quite some time to get good audio output using mp3 files. Here are some of the hints I found, these probably break down if your resources are large.

Don't use classical/multi-instrument data. Your model will use a relatively small receptive field, and processing classical music at anything less than 5-10s per window sounds like chaos. Use a single instrument, I chose solo piano and it instantly yielded better results.
Use homogeneous music. I got better results when I use similar music. Mixing high tempo or upbeat samples with somber, slower pieces tended to give worse results in the same amount of training time as similar sounding data.
Max out your receptive field size in wavenet_params.json. You can do this by stacking [1..256] layers, or you can extend them to [1..4096]. This may not help since 4096 steps between conv ops is very sparse, but it technically makes the receptive field larger so try it out.
If your model is diverging on generation, maybe change the output ReLUs to scaled tanh units to bound the output.
I used 10GB of youtube piano, trained for 200K+ steps. And that wasn't enough to reach Deepmind-level results.
try google cloud with the free credits when you sign up. You need more compute power with more data than what is usually available to a single person if you want a non-trivial output.

I hope some of this helps @pennygalaxi

devinplatt commented 7 years ago

Hey @Neale, thanks for the tips! :) I'm wondering if you could elaborate on a few of your points:

When you say that your model trained for 200k steps on 10GB of youtube audio didn't reach Deepmind-level results, was this due more to musical traits or the sound quality itself? (It'd be great to hear if training longer reduces blemishes like crackling/hissing/static like those I get at: https://soundcloud.com/user-407216443/birds_12k_steps_default_wavenet_params.) Do you have any generated audio from this model that you could share?
Have you tried replacing ReLUs with scaled tanh units? And by "diverging" do you mean that large values from the ReLUs at times can cause peaks in the output audio might that cause "crackling"?
Just out of curiosity, what did you use as the source for your 10GB of youtube piano audio?
Finally, how do you feel about using the 1070 GTX for this task? (Thinking about buying one myself.)

Thanks!

njbittner commented 7 years ago

@neale or @veqtor, any chance either of you would be willing to share one of your trained music models?

ibab / tensorflow-wavenet

What is the rule of thumb for generating music with wavenet? #195