Clarification on dataset format(s)

minzwon / sota-music-tagging-models

MIT License

403 stars 65 forks source link

Clarification on dataset format(s) #26

Closed JLenzy closed 1 year ago

JLenzy commented 1 year ago

I have two goals:

run inference from some of the pre-trained models on my own dataset
train a new model on my own dataset(s)

In both cases I am having trouble due to the dataset formats; it seems that the scripts require a very specific format of dataset which is not really detailed in the readme. If you could provide any clarification on how our datasets should be formatted, this would be greatly appreciated! Thanks in advance.

minzwon commented 1 year ago

Hi,

You can preprocess the audio into a .npy format using preprocessing. https://github.com/minzwon/sota-music-tagging-models/blob/master/preprocessing/mtat_read.py

In this experiment, I preprocessed the audio into a .npy format in advance because downsampling is time-consuming. But if your audio is already in the target sampling rate, you can also consider loading audio on-the-fly using librosa or essentia libraries.

expectopatronum commented 1 year ago

Hi, was a modified version of mtat_read used for MTG Jamendo as well? Because AudioFolder seems to expect npy files. Or should they be included in MTG-Jamendo? (I don't remember when I downloaded the dataset, but my copy does not contain any numpy files)

EDIT: I just saw python scripts/baseline/get_npy.py run 'your_path_to_spectrogram_npy' in the MTG-Jamendo description, is this the correct preprocessing?

Best regards Verena

minzwon commented 1 year ago

Hi Verena,

There are two ways of handling it.

Preprocess your MTG-Jamendo audio files into .npy files like I did for the MagnaTagATune dataset. This will be the easiest choice of using this repository but you need extra space to store them.
Another way is to read audio directly during the training. In this case, you need to modify AudioFolder.get_npy to read audio files.

When I worked on this project, I needed downsampling to work with a 16kHz sampling rate, so I decided to store them into .npy format. But this format is inefficient, so I recommend using audio files instead of .npy.

The preprocessing script you provided is different from this repository. This repo calculates mel spectrograms on-the-fly. The .npy files include raw audio, not mel spectrograms.