Open 0x00b1 opened 6 years ago
👍
I am currently converting a few of my dataset loaders for pytorch and will successively open pull requests for some of them. However, for many datasets (espescially those in mirex) an automatic way of using and downloading is not possible. I would therefore suggest only to add those that are released under a CC (or similar license) and have sustainable file location (ideally zenodo).
On my 2do list are:
I am going to do
and in near future
There is also lists:
It would be great if there were more datasets. However, not all of the dataset owners want to be included in a popular library like PyTorch. For example, I talked to the person running openslr.org and he said the hosting costs were already pretty high. He preferred that the larger datasets were not included in the library, so we decided to add the yesno toy dataset.
@dhpollack if the owner of openslr is okay about redistributable dataset, we can host it on pytorch s3 bucket for pytorch DataLoader
@soumith, I'll get back in touch with him and see what he thinks about it.
@dhpollack @soumith this is a very good point. We should be careful then.... Many of the sets like LibriSpeech are under CC license that means they can also easily uploaded to zenodo as they support direct downloading.
I strongly encourage to use zenodo. It can then be versioned. Shall we do it ?
I strongly encourage to use zenodo. It can then be versioned. Shall we do it ?
I think this should be done by the authors of OpenSLR. Also Zenodo has 50GB limits per dataset so maybe that is an issue here....
@0x00b1 I suggest you rename this issue as we are out of the MIR scope now ;-)
Person who hosts Librispeech here. I am definitely OK with you guys redistributing it. (That is the preferred solution because PyTorch is widely used and a data loader would drive up our hosting costs).
For all these speech datasets, bear in mind that they won't necessarily have a common format so you can make them all look identical after loading. E.g. some may have transcripts per segment and some per file; some may have overlapping speech marked; the AMI dataset has various microphone choices. And they have different transcription conventions and the conventional scoring methods are also usually different. I recommend to start with TED-LIUM v2 because it's simpler than AMI, and not as large as LibriSpeech.
Also be sure to put prominent warnings on all of the data loaders about how large they are and how much space they will require. I expect you will get a lot of noobs trying to download Librispeech onto their MacBook Air, otherwise.
Person who hosts Librispeech here. I am definitely OK with you guys redistributing it. (That is the preferred solution because PyTorch is widely used and a data loader would drive up our hosting costs).
@danpovey thats good news. I think this should be coordinated to have a consistent experience, so maybe one does all the uploads. Maybe also it would be great to create a zenodo community to put all openslr datasets under the same umbrella.
This is the first I heard of zenodo but it does look like a good option for the future. We'd have to upload our data there first which would involve a certain amount of human effort but it would reduce our hosting costs.
@jacobkahn for libri-light
Regarding Libri-Light, redistribution is fine. We can conceivably host a compatible version in the same place where we host the existing dataset (S3) if that's needed.
cc @eugene-kharitonov @Molugan for PyTorch loaders for Libri-Light.
Each year the Music Information Retrieval Evaluation Exchange (MIREX) sponsors a number of noteworthy competitions for problems like chord, key change, and tempo estimation. Many of the competitions have longstanding (sometimes 10+ years) training and test sets. If torchaudio provided these datasets in a standard format (similar to VCTK or something like PASCAL from torchvision), PyTorch would become an invaluable toolkit for researchers working on these types of problems.