pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.49k stars 644 forks source link

Music Information Retrieval Evaluation Exchange (MIREX) datasets #31

Open 0x00b1 opened 6 years ago

0x00b1 commented 6 years ago

Each year the Music Information Retrieval Evaluation Exchange (MIREX) sponsors a number of noteworthy competitions for problems like chord, key change, and tempo estimation. Many of the competitions have longstanding (sometimes 10+ years) training and test sets. If torchaudio provided these datasets in a standard format (similar to VCTK or something like PASCAL from torchvision), PyTorch would become an invaluable toolkit for researchers working on these types of problems.

faroit commented 6 years ago

👍

I am currently converting a few of my dataset loaders for pytorch and will successively open pull requests for some of them. However, for many datasets (espescially those in mirex) an automatic way of using and downloading is not possible. I would therefore suggest only to add those that are released under a CC (or similar license) and have sustainable file location (ideally zenodo).

On my 2do list are:

cyrta commented 6 years ago

I am going to do

and in near future

There is also lists:

dhpollack commented 6 years ago

It would be great if there were more datasets. However, not all of the dataset owners want to be included in a popular library like PyTorch. For example, I talked to the person running openslr.org and he said the hosting costs were already pretty high. He preferred that the larger datasets were not included in the library, so we decided to add the yesno toy dataset.

soumith commented 6 years ago

@dhpollack if the owner of openslr is okay about redistributable dataset, we can host it on pytorch s3 bucket for pytorch DataLoader

dhpollack commented 6 years ago

@soumith, I'll get back in touch with him and see what he thinks about it.

faroit commented 6 years ago

@dhpollack @soumith this is a very good point. We should be careful then.... Many of the sets like LibriSpeech are under CC license that means they can also easily uploaded to zenodo as they support direct downloading.

cyrta commented 6 years ago

I strongly encourage to use zenodo. It can then be versioned. Shall we do it ?

faroit commented 6 years ago

I strongly encourage to use zenodo. It can then be versioned. Shall we do it ?

I think this should be done by the authors of OpenSLR. Also Zenodo has 50GB limits per dataset so maybe that is an issue here....

faroit commented 6 years ago

@0x00b1 I suggest you rename this issue as we are out of the MIR scope now ;-)

danpovey commented 6 years ago

Person who hosts Librispeech here. I am definitely OK with you guys redistributing it. (That is the preferred solution because PyTorch is widely used and a data loader would drive up our hosting costs).

For all these speech datasets, bear in mind that they won't necessarily have a common format so you can make them all look identical after loading. E.g. some may have transcripts per segment and some per file; some may have overlapping speech marked; the AMI dataset has various microphone choices. And they have different transcription conventions and the conventional scoring methods are also usually different. I recommend to start with TED-LIUM v2 because it's simpler than AMI, and not as large as LibriSpeech.

danpovey commented 6 years ago

Also be sure to put prominent warnings on all of the data loaders about how large they are and how much space they will require. I expect you will get a lot of noobs trying to download Librispeech onto their MacBook Air, otherwise.

faroit commented 6 years ago

Person who hosts Librispeech here. I am definitely OK with you guys redistributing it. (That is the preferred solution because PyTorch is widely used and a data loader would drive up our hosting costs).

@danpovey thats good news. I think this should be coordinated to have a consistent experience, so maybe one does all the uploads. Maybe also it would be great to create a zenodo community to put all openslr datasets under the same umbrella.

danpovey commented 6 years ago

This is the first I heard of zenodo but it does look like a good option for the future. We'd have to upload our data there first which would involve a certain amount of human effort but it would reduce our hosting costs.

vincentqb commented 4 years ago

@jacobkahn for libri-light

jacobkahn commented 4 years ago

Regarding Libri-Light, redistribution is fine. We can conceivably host a compatible version in the same place where we host the existing dataset (S3) if that's needed.

cc @eugene-kharitonov @Molugan for PyTorch loaders for Libri-Light.