Code for the paper "Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription". This repository contains the following (and more):
A dedicated Hugging Face Space is available for performing inference with Timbre-Trap. Audio samples and visualizations for some experiments in the paper can be found here.
Clone the repository, install the requirements, then install timbre-trap
:
git clone https://github.com/sony/timbre-trap
pip install -r timbre-trap/requirements.txt
pip install -e timbre-trap/
PyTorch datasets wrappers for several relevant datasets are available through the datasets
subpackage.
Note that these are organized by data type, i.e. multi-instrument audio mixtures and single-instrument audio stems with and without accompanying annotations.
Some datasets have wrappers for both mixtures and stems.
The wrappers also differentiate between frame-level pitch (MPE) and note-level (AMT) annotations, depending on what is available for each dataset.
The following is an example of how to use a dataset wrapper:
from timbre_trap.datasets.MixedMultiPitch import URMP
from timbre_trap.utils import constants
urmp_data = URMP(base_dir=None,
splits=None,
sample_rate=22050,
cqt=None,
n_secs=None,
seed=0)
for track in urmp_data:
name = data[constants.KEY_TRACK]
audio = data[constants.KEY_AUDIO]
times, multipitch = urmp_data.get_ground_truth(name)
By default, the wrapper will look for the top-level dataset directory at ~/Desktop/Datasets/<DATASET>
.
However, this path can be specified using the base_dir
keyword argument.
If the dataset does not exist, it will be downloaded automatically (except for in cases where this is not possible).
The splits
keyword can be used to partition the data based on pre-defined attributes or metadata.
If overridden, it should be given some subset of the output from the available_splits()
function for each respective wrapper.
A CQT module wrapper must be provided to the cqt
argument in order to convert the ground-truth to targets during training:
from timbre_trap.framework import CQT
cqt_module = CQT(n_octaves=9,
bins_per_octave=60,
sample_rate=22050,
secs_per_block=3)
urmp_data.cqt = cqt_module
for track in urmp_data:
name = data[constants.KEY_TRACK]
audio = data[constants.KEY_AUDIO]
ground_truth = data[constants.KEY_GROUND_TRUTH]
The 2D autoencoder model used in the Timbre-Trap framework can be initialized with the following:
from timbre_trap.framework import Timbre-Trap
model = TimbreTrap(sample_rate=22050,
n_octaves=9,
bins_per_octave=60,
secs_per_block=3,
latent_size=128,
model_complexity=2,
skip_connections=False)
Under the hood this will also initialize a CQT module, accessible with model.cqt
, which provides several useful utilities.
These include conversion between real-valued and complex-valued coefficients, synthesis of audio coefficients, and acquisition of times for each frame of coefficients.
Weights for the base model from our paper are available for download within our dedicated [Hugging Face Space](Hugging Face). The weights can be loaded after initializing the appropriate model:
model.load_state_dict(torch.load(weights_path))
The script experiments/train.py
exemplifies the training process for the framework.
It should be run with experiments/
as the current working directory.
Relevant hyperparameters for experimentation are defined at the top of the script.
The training script also utilizes an evaluation loop defined in experiments/evaluate.py
, which can be invoked independently:
from evaluate import evaluate
results = evaluate(model=model,
eval_set=val_set,
multipliers=[1, 1, 1])
Helper functions are available for performing both reconstruction and transcription at inference time:
transcription = model.transcribe(audio)
reconstruction = model.reconstruct(audio)
The script experiments/comparison.py
exemplifies using the Timbre-Trap framework for inference, and compares results to those computed for baseline models Deep-Salience and Basic-Pitch.
Additional examples of inference are provided in the scripts experiments/latents.py
and experiments/sonify.py
.
These scripts visualize a reduced latent space and synthesize audio from reconstructed spectral and transcription coefficients for Bach10, respectively.
Execution of experiments/train.py
will generate the following under <root_dir>
(defined at the top of the script):
n/
- folder (beginning at n = 1
)1 containing sacred experiment files:
config.json
- parameter values used for the experimentcout.txt
- contains any text printed to consolemetrics.json
- validation and evaluation results for the best model checkpointrun.json
- system and experiment informationmodels/
- folder containing saved model weights at each checkpoint, as well as an events file (for each execution) readable by tensorboard_sources/
- folder containing copies of scripts at the time(s) of execution1An additional folder (n += 1
) containing similar files is created for each execution with the same experiment name <EX_NAME>
.
Losses and various validation metrics can be analyzed in real-time by running:
tensorboard --logdir=<root_dir>/models --port=<port>
Here we assume the current working directory contains <root_dir>
, and <port>
is an integer corresponding to an available port (port = 6006
if unspecified).
After running the above command, navigate to [http://localhost:<port>]() with an internet browser to view any reported training or validation observations within the tensorboard interface.
@article{cwitkowitz2024timbre,
title = {{Timbre-Trap}: A Low-Resource Framework for Instrument-Agnostic Music Transcription},
author = {Cwitkowitz, Frank and Cheuk, Kin Wai and Choi, Woosung and Mart{\'\i}nez-Ram{\'\i}rez, Marco A and Toyama, Keisuke and Liao, Wei-Hsiang and Mitsufuji, Yuki},
year = 2024,
booktitle = {Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}
}