minzwon / sota-music-tagging-models

MIT License
403 stars 65 forks source link

Clarification on dataset format(s) #26

Closed JLenzy closed 1 year ago

JLenzy commented 1 year ago

I have two goals:

In both cases I am having trouble due to the dataset formats; it seems that the scripts require a very specific format of dataset which is not really detailed in the readme. If you could provide any clarification on how our datasets should be formatted, this would be greatly appreciated! Thanks in advance.

minzwon commented 1 year ago

Hi,

You can preprocess the audio into a .npy format using preprocessing. https://github.com/minzwon/sota-music-tagging-models/blob/master/preprocessing/mtat_read.py

In this experiment, I preprocessed the audio into a .npy format in advance because downsampling is time-consuming. But if your audio is already in the target sampling rate, you can also consider loading audio on-the-fly using librosa or essentia libraries.

expectopatronum commented 1 year ago

Hi, was a modified version of mtat_read used for MTG Jamendo as well? Because AudioFolder seems to expect npy files. Or should they be included in MTG-Jamendo? (I don't remember when I downloaded the dataset, but my copy does not contain any numpy files)

EDIT: I just saw python scripts/baseline/get_npy.py run 'your_path_to_spectrogram_npy' in the MTG-Jamendo description, is this the correct preprocessing?

Best regards Verena

minzwon commented 1 year ago

Hi Verena,

There are two ways of handling it.

When I worked on this project, I needed downsampling to work with a 16kHz sampling rate, so I decided to store them into .npy format. But this format is inefficient, so I recommend using audio files instead of .npy.

The preprocessing script you provided is different from this repository. This repo calculates mel spectrograms on-the-fly. The .npy files include raw audio, not mel spectrograms.