State-of-the-art Music Tagging Models

PyTorch implementation of state-of-the-art music tagging models :notes:

Demo and Docker image on Replicate

Reference

Evaluation of CNN-based Automatic Music Tagging Models, SMC 2020 [arxiv]

-- Minz Won, Andres Ferraro, Dmitry Bogdanov, and Xavier Serra

TL;DR

If your dataset is relatively small: take advantage of domain knowledge using Musicnn.
If you want a simple but the best performing model: Short-chunk CNN with Residual connection (so-called vgg-ish model with a small receptive field)
If you want the best performance with generalization ability: Harmonic CNN

Available Models

FCN : Automatic Tagging using Deep Convolutional Neural Networks, Choi et al., 2016 [arxiv]
Musicnn : End-to-end Learning for Music Audio Tagging at Scale, Pons et al., 2018 [arxiv]
Sample-level CNN : Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms, Lee et al., 2017 [arxiv]
Sample-level CNN + Squeeze-and-excitation : Sample-level CNN Architectures for Music Auto-tagging Using Raw Waveforms, Kim et al., 2018 [arxiv]
CRNN : Convolutional Recurrent Neural Networks for Music Classification, Choi et al., 2016 [arxiv]
Self-attention : Toward Interpretable Music Tagging with Self-Attention, Won et al., 2019 [arxiv]
Harmonic CNN : Data-Driven Harmonic Filters for Audio Representation Learning, Won et al., 2020 [pdf]
Short-chunk CNN : Prevalent 3x3 CNN. So-called vgg-ish model with a small receptieve field.
Short-chunk CNN + Residual : Short-chunk CNN with residual connections.

Requirements

conda create -n YOUR_ENV_NAME python=3.7
conda activate YOUR_ENV_NAME
pip install -r requirements.txt

Preprocessing

STFT will be done on-the-fly. You only need to read and resample audio files into .npy files.

cd preprocessing/

python -u mtat_read.py run YOUR_DATA_PATH

Training

cd training/

python -u main.py --data_path YOUR_DATA_PATH

Options

'--num_workers', type=int, default=0
'--dataset', type=str, default='mtat', choices=['mtat', 'msd', 'jamendo']
'--model_type', type=str, default='fcn',
                choices=['fcn', 'musicnn', 'crnn', 'sample', 'se', 'short', 'short_res', 'attention', 'hcnn']
'--n_epochs', type=int, default=200
'--batch_size', type=int, default=16
'--lr', type=float, default=1e-4
'--use_tensorboard', type=int, default=1
'--model_save_path', type=str, default='./../models'
'--model_load_path', type=str, default='.'
'--data_path', type=str, default='./data'
'--log_step', type=int, default=20

Evaluation

cd training/

python -u eval.py --data_path YOUR_DATA_PATH

Options

'--num_workers', type=int, default=0
'--dataset', type=str, default='mtat', choices=['mtat', 'msd', 'jamendo']
'--model_type', type=str, default='fcn',
                choices=['fcn', 'musicnn', 'crnn', 'sample', 'se', 'short', 'short_res', 'attention', 'hcnn']
'--batch_size', type=int, default=16
'--model_load_path', type=str, default='.'
'--data_path', type=str, default='./data'

Performance Comparison

Performances of SOTA models

minzwon / sota-music-tagging-models

readme