AugmentedNet is an automatic Roman numeral analysis neural network.
The network was developed by Néstor Nápoles López as part of his PhD research. It was first mentioned in a co-authored ISMIR paper in 2021, and later on in the body of the dissertation.
It has been used to power the analysis features in at least the following projects:
The version documented in the PhD dissertation matches the v1.9.1
release of this repository.
The older version of the model described in the ISMIR paper matches the v1.0.0
release of this repository.
In general, the results of v1.9.1
are better and it is encouraged to use (and compare against) that version.
Nápoles López, Néstor. 2022. “Automatic Roman Numeral Analysis in Symbolic Music Representations.” PhD Thesis, McGill University. https://escholarship.mcgill.ca/concern/theses/qr46r6307.
@phdthesis{napoleslopez22automatic,
type = {{PhD} {Thesis}},
title = {Automatic {Roman} {Numeral} {Analysis} in {Symbolic} {Music} {Representations}},
url = {https://escholarship.mcgill.ca/concern/theses/qr46r6307},
school = {McGill University},
author = {Nápoles López, Néstor},
month = dec,
year = {2022}
}
N. Nápoles López, M. Gotham, and I. Fujinaga, “AugmentedNet: A Roman Numeral Analysis Network with Synthetic Training Examples and Additional Tonal Tasks.” in Proceedings of the 22nd International Society for Music Information Retrieval Conference, 2021, pp. 404–411. https://doi.org/10.5281/zenodo.5624533
@inproceedings{napoleslopez21augmentednet,
author = {Nápoles López, Néstor and Gotham, Mark and Fujinaga, Ichiro},
title = {{AugmentedNet: A Roman Numeral Analysis Network
with Synthetic Training Examples and Additional
Tonal Tasks}},
booktitle = {{Proceedings of the 22nd International Society for
Music Information Retrieval Conference}},
year = 2021,
pages = {404-411},
publisher = {ISMIR},
address = {Online},
month = nov,
venue = {Online},
doi = {10.5281/zenodo.5624533},
url = {https://doi.org/10.5281/zenodo.5624533}
}
Clone, create a virtual environment, and get the python
dependencies.
git clone https://github.com/napulen/AugmentedNet.git
cd AugmentedNet
python3 -m venv .env
source .env/bin/activate
(.env) pip install -r requirements.txt
I have experienced that
pip
is sometimes incapable of installing specific package versions depending on your environment. Thisrequirements.txt
was tested on a vanillaUbuntu 20.04
, both in native linux and Windows 10 WSL2. A dockertensorflow/tensorflow:2.5-gpu
image should also work.
Run the pre-trained model for inference on a MusicXML
file
python -m AugmentedNet.inference AugmentedNetv.hdf5 <input_file>.musicxml
Two files will be generated:
<input_file>_annotated.xml
<input_file>_annotated.csv
An annotated MusicXML
file and the csv
file with the predictions of every time step.
Clone recursively (needed to collect the third-party datasets), create a virtual environment, and get the python
dependencies
git clone --recursive https://github.com/napulen/AugmentedNet.git
cd AugmentedNet
python3 -m venv .env
source .env/bin/activate
(.env) pip install -r requirements.txt
To save you some time, we include the preprocessed tsv
files of the real data, as well as the synthetic block-chord templates for texturization. These are available in the release of the latest version.
wget https://github.com/napulen/AugmentedNet/releases/latest/download/dataset.zip
unzip dataset.zip
Now you are ready to train the network.
There are two ways to generate texturizations: at the tsv-level (legacy) and at the numpy-level (newer).
You can generate one texturization per file in the dataset with this script
(.env) python -m AugmentedNet.dataset_tsv_generator --synthesize --texturize
Originally, this is how I trained v1.0.0
. The network was trained with exactly twice the amount of real training data.
After v1.5.0
, the tsv-dataset only includes the templates (i.e., block chords). The texturization is done when encoding the numpy arrays for the neural network. Right before training.
There are two options for texturization in this way
In the second approach, every time you transpose a synthetic example to a different key, you re-texturize it. The amount of "new" training examples seen by the network is much larger this way.
You can control those settings when training the network.
# No synthetic examples
(.env) python -m AugmentedNet.train <compulsory_args>
# Synthetic examples, one texturization per file
(.env) python -m AugmentedNet.train <compulsory_args> --syntheticDataStrategy concatenate
# Synthetic examples, one texturization per transposition
(.env) python -m AugmentedNet.train <compulsory_args> --syntheticDataStrategy concatenate --texturizeEachTransposition
See next section for the <compulsory_args>
.
At the moment, the code for generating the texturizations is not extremely simple, if you only wanted to do that. However, raise an issue, reach out, and I'll make my best effort to help you on your use case.
If you want to train the network, the minimum call looks like this.
(.env) python -m AugmentedNet.train debug testexperiment
The code is integrated with mlflow. In the training script, debug
and testexperiment
refer to the experiment and run names passed down to mlflow. You can access more CLI parameters by running python -m AugmentedNet.train --help
.
After training the network, you will get a path to the trained hdf5
model, which looks something like this:
The trained model is available in: .model_checkpoint/debug/testexperiment-220101T000000/81-6.000-0.78.hdf5
You can use that trained model for inference, using the same workflow shown above.
The architecture is a CRNN (Convolutional Recurrent Neural Network) with an alternative representation of pitch spelling at the input.
More information about the neural network architecture can be found in the paper.
This repository is organized in the following way
The general organization of the code is summarized by the following diagram.
Each of the blue rectangles roughly corresponds to a Python module.
The inputs of the network are pairs of (MusicXML, RomanText) files.
The inputs pairs are converted into pandas DataFrame objects, stored as .tsv
files.
Later on, these are encoded in a representation that can be dispatched to the neural network.
The module documentation is located here.
All the experiments presented in the paper were monitored using mlflow
.
If you want to visualize the experiments with the mlflow ui:
pip install mlflow
mlflow ui
from the terminal; make sure that ./mlruns/
is reachable from the current directorylocalhost:5000
For extra convenience, I also uploaded the logs to TensorBoard.dev.
Here are the tables of the paper and a link to see the runs of each model in Tensorboard.dev.
These are the results for the four different configurations of the AugmentedNet.
Model | Key | Deg. | Qual. | Inv. | Root | RN |
---|---|---|---|---|---|---|
AugmentedNet6 |
82.7 | 64.4 | 76.6 | 77.4 | 82.5 | 43.3 |
AugmentedNet6+ |
83.0 | 65.1 | 77.5 | 78.6 | 83.0 | 44.6 |
AugmentedNet11 |
81.3 | 64.2 | 77.2 | 76.1 | 82.9 | 43.1 |
AugmentedNet11+ |
83.7 | 66.0 | 77.6 | 77.2 | 83.2 | 45.0 |
Visualize experiments in TensorBoard.dev!
6
and 11
indicate the number of tasks in the multitask learning layout.
+
indicates the use of synthetic training data.
These are the results for the best AugmentedNet configuration (11+) against other models.
Test set | Training set | Model | Key | Degree | Quality | Inversion | Root | ComRN | RNconv | RNalt |
---|---|---|---|---|---|---|---|---|---|---|
Full test set | Full dataset | AugN | 82.9 | 67.0 | 79.7 | 78.8 | 83.0 | 65.6 | 46.4 | 51.5 |
WiR | Full dataset | AugN | 81.8 | 69.2 | 85.9 | 90.3 | 90.3 | 70.2 | 56.4 | 62.4 |
HaydnSun | Full dataset | AugN | 81.2 | 62.9 | 80.2 | 82.7 | 86.5 | 60.4 | 48.6 | 52.1 |
ABC | Full dataset | AugN | 83.6 | 65.6 | 78.0 | 76.9 | 78.9 | 62.6 | 44.5 | 48.4 |
TAVERN | Full dataset | AugN | 88.7 | 60.0 | 77.4 | 78.8 | 81.5 | 66.3 | 42.6 | 52.9 |
WTC | Full dataset | AugN | 77.2 | 69.7 | 75.0 | 74.4 | 82.7 | 61.7 | 46.2 | 47.9 |
WTCcrossval | BPS+WTC | AugN | 85.1(4.0) | 62.9(5.5) | 69.1(1.9) | 70.1(3.7) | 79.2(1.8) | 59.9(3.4) | 42.9(4.2) | 46.9(4.7) |
WTCcrossval | BPS+WTC | CS21 | 56.3(2.5) | - | - | - | - | - | 26.0(1.7) | - |
BPS | Full dataset | AugN | 85.0 | 73.4 | 79.0 | 73.4 | 84.4 | 68.3 | 45.4 | 49.3 |
BPS | All data | Mi20 | 82.9 | 68.3 | 76.6 | 72.0 | - | - | 42.8 | - |
BPS | BPS+WTC | AugN | 82.9 | 70.9 | 80.7 | 72.0 | 85.3 | 67.6 | 44.1 | 47.5 |
BPS | BPS+WTC | CS21 | 79.0 | - | - | - | - | - | 41.7 | - |
BPS | BPS | AugN | 83.0 | 71.2 | 80.3 | 71.1 | 84.1 | 68.5 | 44.0 | 47.4 |
BPS | BPS | Mi20 | 80.6 | 66.5 | 76.3 | 68.1 | - | - | 39.1 | - |
BPS | BPS | CS19 | 78.4 | 65.1 | 74.6 | 62.1 | - | - | - | - |
BPS | BPS | CS18 | 66.7 | 51.8 | 60.6 | 59.1 | - | - | 25.7 | - |