Best parameters for a layperson

sevagh commented 4 years ago

Hello - thanks for this repo. I unsuccessfully tried running several SampleRNN forks which were outdated and couldn't work on a modern CUDA toolkit.

I want to run some experiments, train some music. Right now, for example, I'm trying to generate an imitation/fake of the band Periphery. I've downloaded every Periphery song, resampled to 16kHz mono, chunked with the chunk script.

My next step is to follow what's in your README:

python train.py --id periphery_only --data_dir ./experiment-1/periphery-chunks/ --num_epochs 100 --batch_size 64 --sample_rate 16000

Are the default parameters "good enough" for this use case, to generate music in imitation of an artist? Thanks.

relativeflux commented 4 years ago

Hi there, thanks for your interest in the code, and apologies for the slight delay in responding. Indeed, the other available implementations are all broken, it seems... I was able to get the samplernn-pytorch version working, sort of, but only by installing specific, older versions of most of the required libraries. But even that implementation looks to be unmaintained now, just like the others. This was part of the motivation for 'reviving' SampleRNN.

Training neural networks is, ironically, more of an art than a science, and depends on a lot of trial and error... So I'd say just dive into it with the default settings initially, and then see what results you get. I tend to switch off generation for the first couple of ecperiments, until I know I'm going to get something reasonable, or on the way to 'good enough'... I might train for about 250 epochs, and save checkpoints every 10 epochs. Of course it depends on the size of your dataset... I'm currently training on a Shostakovich symphony, downsampled to 16k with 8 second chunks, and on my RTX Titan card it's going to take about 16 hours (250 epochs). So be prepared to wait a while! Perhaps being with a reduced dataset... although be careful of going too low, a small dataset is typically prone to overfitting (where the network just memorizes the dataset, eventually).

The next big update to the repo is going to add a validation step, which I wasn't able to implement until now... Validation involves reserving a certain number (say, 1%) of input chunks, then at the end of each epoch the network is exposed to this mini-dataset but not trained on them... So it's always seeing fresh data to which it hasn't be exposed before. This can be an important factor in avoiding over/underfitting, and tweaking hyperparameters. This will all be automated, all you'll need to do is specify the train/validate split, through the val_pcnt parameter (which is ignored at the moment). I'll be pushing that update out next week, keep an eye out for it.

BTW, what card have you got?

sevagh commented 4 years ago

I have an RTX 2070 Super. My initial data sets consist of 4 albums of Periphery (~4 hours of music - instrumental metal, downloaded from these playlists):

# youtube playlists for instrumental Periphery albums - Periphery III, I, II, IV, Omega, Juggernaut
periphery_album_1="PLSTnbYVfZR03JGmoJri6Sgvl4f0VAi9st"
periphery_album_2="PL7DVODcLLjFplM5Rw-bNUyrwAECIPRK26"
periphery_album_3="PLuEYu7jyZXdde7ePWV1RUvrpDKB8Gr6ex"
periphery_album_45="PLEFyfJZV-vtKeBedXTv82yxS7gRZkzfWr"
periphery_album_6="PL6FJ2Ri6gSpOWcbdq--P5J0IRcgH-4RVm"

youtube-dl -ci -f "bestaudio" -x --audio-format wav -i ${periphery_album_1}
youtube-dl -ci -f "bestaudio" -x --audio-format wav -i ${periphery_album_2}
youtube-dl -ci -f "bestaudio" -x --audio-format wav -i ${periphery_album_3}
youtube-dl -ci -f "bestaudio" -x --audio-format wav -i ${periphery_album_45}
youtube-dl -ci -f "bestaudio" -x --audio-format wav -i ${periphery_album_6}

All mono, downsampled to 16k, chopped into 8 second chunks (using your chunk_audio.py script with audio silence removal).

The training (100 epochs) has been running for 2 days and counting. My main goal here is to see whether I can recreate other neural network musicians e.g. https://dadabots.com/

I believe that in their workflows, the final step involves generating hours upon hours of music from the trained models, and then hand-curating the generated music; i.e. it's very unrealistic to expect a generated set of samples to sound like a cohesive song.

Thanks for replying - if you're interested I can share my results (if I end up creating any sort of funky music).

rncm-prism / prism-samplernn

Best parameters for a layperson #5