This repository provides a configurable deep convolutional neural network to isolate vocals from music written in python3.6. It is based on the acapellabot by madebyollin. To train the network the MedleyDB dataset was used.
The following libraries and packages need to be installed to use this project.
The settings used for execution are configurable by either exporting the appropriate environment variables, by directly setting the values in the Config
class or, when using the grid search, by specifying the configuration in a .yml
file. The configuration is applied using reflection.
Some predefined configurations can be found in the envs
directory. Source the environment file to load the configuration.
E.g. for the lps
environment run
source envs/lps
A list of available options can be found at the end of this readme.
To train the network the corpus needs to contain the following files for each of the tracks:
_all
_vocal
_instrumental
The DATA
needs to point to the directory where the corpus is stored.
The training split can be specified by SPLIT
. The Config
class also contains an option to define the validation and test tracks directly.
The amount of epochs to train for can be set using EPOCHS
.
The WEIGHTS
points to the .h5
or .hdf5
file in which the weights should be stored.
More configuration options can be found at the end of this readme.
After the configuration the project can be executed by invoking
python3 vocal_isolation.py
The execution logs can be found in the LOG_BASE
directory.
Two different learning approaches are available. The first one is similar to the original acapellabot using the log-power spectrograms (LPS) to train on. In this approach the phase information is lost and needs to be reconstructed using successive approximation.
The second approach uses the real and imaginary parts of the complex spectrograms (RI) to learn the phase as well.
The learning approach can be chosen by setting LEARN_PHASE
to False
for LPS or to True
for RI.
To train multiple configurations in one execution the grid_search.py
can be used. It reads .yml
files stored in grid-search-configs
folder and creates configurations for every possible combination in the .yml
file. When using the grid search all the output artifacts, including the weights, will be written to subfolders per configuration in the LOG_BASE
directory.
The grid search can be executed by invoking
python3 grid_search.py [myconfig.yml]
.
If no .yml
file is specified it will use the default grid_search.yml
containing all possible configurations.
After the network is trained the produced weights stored in WEIGHTS
can be used to perform inference on a given track to isolate the vocals. As the inference is computationally expensive it is not performed on the complete file, but smaller slices. The size of such a slice can be set by INFERENCE_SLICE
. If your computer runs out of memory while inferencing, consider reducing the slice size.
An inference can be executed by invoking
python3 vocal_isolation.py filetoinfer.wav
Different functionalities for analysis are available in the Analysis
class including the short-term-objective intelligibility (STOI) measure using the stoi.m to calculate how well the produced output is understandable.
The analysis.py
can be invoked using the following parameters:
--analyse
or -a
: Analysis method to be executed--save
or -s
: Specifies whether the result should be saved*args
: additional arguments depending on the analysis functionalityIf the save option is specified the results will be written to the directory given by ANALYSIS_PATH
.
The following analysis functionalities are available:
Calculate the stoi value.
If both arguments are given the specified files will be used for the STOI calculation. Otherwise the clean file will be determined using the mix file.
python3 analysis.py -a stoi -s myfile.wav [cleanfile.wav]
Calculate the value distributions and their difference to the median for each percentile on the whole data set located at DATA
and create a box plot.
No additional arguments are required.
python3 analysis.py -a percentile -s
Calculate the mean squared error between a processed and clean vocal file.
If no arguments are given the MSE analysis calculates the mean squared error for each track in the validation and test set.
python3 analysis.py -a mse -s [myprocessedvocal.wav] [cleanvocal.wav]
Scales the volume between a ratio of 1/100 and 100 and calculates the MSE for each ratio and plots the result.
python3 -a volume -s myfile.wav
Calculates the value distributions of the dataset and plots them in a histogram.
No additional arguments required
python3 -a distribution -s
Variable | Description | Possible Values | Default |
---|---|---|---|
ANALYSIS_PATH | Path to store analysis results | valid directory | "./analysis" |
BATCH | Batch size used for training | number > 0 | "8" |
BATCH_GENERATOR | Batch generator used for sample creation | keras, default, track, random | "random" |
CHECKPOINTS | Checkpoints to be used by keras | tensorboard, weights, early_stopping | "tensorboard,weights" |
CHOPNAME | Slicing function for sample creation | tile, full, sliding_full, filtered, filtered_full, random, random_full, infere (only used for inference) | "tile" |
CHOPPARAMS | Parameter to configure slicing function | scale (sample size), step (for sliding), slices (for random), upper (only use low frequencies), filter (for filter*) | "{'scale': 128, 'step': 64, 'slices':256, 'upper':False, 'filter':'maximum'}" |
DATA | Path to training data | valid directory | "../bot_data" |
EARLY_STOPPING | Parameters for early stopping checkpoint | min_delta, patience | "{'min_delta': 0.001, 'patience': 3}" |
EPOCHS | Amount of epochs to train for | number > 0 | "10" |
EPOCH_STEPS | Amount of samples for the random generator | number > BATCH | "50000" |
FFT | Window size for STFT | number > 0 | "1536" |
INFERENCE_SLICE | Slice size for inference | number > 0 | "3500" |
INSTRUMENTAL | Flag to train on instrumentals | True, False | "False" |
LOAD | Flag to load previous weights | True, False | "False" |
LOG_BASE | Log directory | valid directory | "./logs" |
LOSS | The loss function to be used by keras | mean_squared_error, mean_absolute_error, mean_squared_log_error | "mean_squared_error" |
METRICS | Metrics to be used by keras | "mean_pred,max_pred" | "mean_pred,max_pred" |
MODEL | The model to be used for training | acapellabot, leaky_dropout | "leaky_dropout" |
MODEL_PARAMS | The parameters to configure the model (only leaky_dropout) | alpha1 and alpha 2 for leakyReLu, rate for dropout | "{'alpha1': 0.1,'alpha2': 0.01,'rate': 0.1}" |
NORMALIZER | The normalizer for data preparation | dummy (no normalization), percentile | "percentile" |
NORMALIZER_PARAMS | The parameters to configure the normalizer (only percentile) | percentile | "{'percentile': 99}" |
OPTIMIZER | The optimizer to be used by keras | adam, rmsprop | "adam" |
OPTIMIZER_PARAMS | The parameters to configure the optimizer | - | "" |
PHASE_ITERATIONS | The amount of iterations for the phase reconstruction | number > 0 | "10" |
LEARN_PHASE | Flag to perform LPS or RI learning | True (RI), False (LPS) | "True" |
QUIT | Flag to quit after training | False, True | "True" |
SPLIT | Percentage for training / validation split | float between 0 and 1 | "0.9" |
START_EPOCH | Starting epoch | number >= 0 | "0" |
TENSORBOARD | Directory to store tensorboard output | valid directory | "./tensorboard" |
TENSORBOARD_INFO | Amount of information to be returned | full, default | "default" |
WEIGHTS | Path to weight file | .h5 or .hdf5 file | "weights/weights.h5" |