Music auto-tagger using keras
..because MusicTaggerCNN
and MusicTaggerCRNN
is based on an old (and a bit incorrect) implementation of Batch Normalization of old Keras (thanks god it worked anyway), it's quite tricky to fix.
MusicTaggerCNN
.MusicTaggerCRNN
. compact_cnn
.keras
to run example.py
.
librosa
.(None, channel, height, width)
, i.e. following theano convention. If you're using tensorflow as your backend, you should check out ~/.keras/keras.json
if image_dim_ordering
is set to th
, i.e."image_dim_ordering": "th",
compact_cnn
, You need to install Kapre.For MusicTaggerCNN
and MusicTaggerCRNN
.
For compact_cnn
Left: compact_cnn CNN, music_tager_cnn. Right: music_tagger_crnn
['rock', 'pop', 'alternative', 'indie', 'electronic', 'female vocalists',
'dance', '00s', 'alternative rock', 'jazz', 'beautiful', 'metal',
'chillout', 'male vocalists', 'classic rock', 'soul', 'indie rock',
'Mellow', 'electronica', '80s', 'folk', '90s', 'chill', 'instrumental',
'punk', 'oldies', 'blues', 'hard rock', 'ambient', 'acoustic', 'experimental',
'female vocalist', 'guitar', 'Hip-Hop', '70s', 'party', 'country', 'easy listening',
'sexy', 'catchy', 'funk', 'electro' ,'heavy metal', 'Progressive rock',
'60s', 'rnb', 'indie pop', 'sad', 'House', 'happy']
compact_cnn
. Otherwise read below.MusicTaggerCNN
is faster than MusicTaggerCRNN
(wall-clock time)MusicTaggerCRNN
have smaller number of trainable parameters. Actually you can even decreases the number of feature maps. The MusicTaggerCRNN
still works quite well in the case - i.e., the current setting is a little bit rich (or redundant). With MusicTaggerCNN
, you will see the performance decrease if you reduce down the parameters. Therefore, if you just wanna use the pre-trained weights, use MusicTaggerCNN
. If you wanna train by yourself, it's up to you. I would use MusicTaggerCRNN
after downsizing it to, like, 0.2M parameters (then the training time would be similar to MusicTaggerCNN
) in general. To reduce the size, change number of feature maps of convolution layers.
By setting include_top=False
, you can get 256-dim (MusicTaggerCNN
) or 32-dim (MusicTaggerCRNN
) feature representation.
In general, I would recommend to use MusicTaggerCRNN
and 32-dim feature as for predicting 50 tags, 256 features actually sound bit too large. I haven't looked into 256-dim feature but only 32-dim features. I thought of using PCA to reduce the dimension more, but ended up not applying it because mean(abs(recovered - original) / original)
are .12
(dim: 32->16), .05
(dim: 32->24) - which don't seem good enough.
Probably the 256-dim features are redundant (which then you can reduce them down effectively with PCA), or they just include more information than 32-dim ones (e.g., features in different hierarchical levels). If the dimension size would not matter, it's worth choosing 256-dim ones.
$ python example_tagging.py
$ python example_feat_extract.py
theano, MusicTaggerCRNN
data/bensound-cute.mp3
[('jazz', '0.444'), ('instrumental', '0.151'), ('folk', '0.103'), ('Hip-Hop', '0.103'), ('ambient', '0.077')]
[('guitar', '0.068'), ('rock', '0.058'), ('acoustic', '0.054'), ('experimental', '0.051'), ('electronic', '0.042')]
data/bensound-actionable.mp3
[('jazz', '0.416'), ('instrumental', '0.181'), ('Hip-Hop', '0.085'), ('folk', '0.085'), ('rock', '0.081')]
[('ambient', '0.068'), ('guitar', '0.062'), ('Progressive rock', '0.048'), ('experimental', '0.046'), ('acoustic', '0.046')]
data/bensound-dubstep.mp3
[('Hip-Hop', '0.245'), ('rock', '0.183'), ('alternative', '0.081'), ('electronic', '0.076'), ('alternative rock', '0.053')]
[('metal', '0.051'), ('indie', '0.028'), ('instrumental', '0.027'), ('electronica', '0.024'), ('hard rock', '0.023')]
data/bensound-thejazzpiano.mp3
[('jazz', '0.299'), ('instrumental', '0.174'), ('electronic', '0.089'), ('ambient', '0.061'), ('chillout', '0.052')]
[('rock', '0.044'), ('guitar', '0.044'), ('funk', '0.033'), ('chill', '0.032'), ('Progressive rock', '0.029')]
Compact CNN: will be updated.
Convnet: Automatic Tagging using Deep Convolutional Neural Networks, Keunwoo Choi, George Fazekas, Mark Sandler 17th International Society for Music Information Retrieval Conference, New York, USA, 2016
ConvRNN : Convolutional Recurrent Neural Networks for Music Classification, Keunwoo Choi, George Fazekas, Mark Sandler, Kyunghyun Cho, arXiv:1609.04243, 2016
Test music items are from http://www.bensound.com.