mir-dataset-loaders / mirdata

Python library for working with Music Information Retrieval datasets
https://mirdata.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
357 stars 59 forks source link

Da-Tacos Dataset #204

Closed rabitt closed 2 years ago

rabitt commented 4 years ago

We currently don't have any datasets for cover song identification, and this is a new, large collection.

https://github.com/MTG/da-tacos

cc @alastair

(Chosen by popular vote on Twitter!)

alastair commented 4 years ago

CC @furkanyesiler and @philtgun !

lostanlen commented 4 years ago

Note that Da-TACOS has no audio, only precomputed features. I'm not sure how that will be encoded in the table of all datasets? #203

furkanyesiler commented 4 years ago

Hi @rabitt and @lostanlen !

First of all, many thanks for all the effort you and your teammates put into this project. As Da-TACOS authors and people at MTG, we're looking forward to being a part of mirdata!

How do you think we should proceed with integrating Da-TACOS here? Currently, all the files are stored in a Google Drive folder and can be downloaded with the script proposed in our repo. In my opinion, to make things more stable, we should finish publishing all the files in Zenodo as the first step.

In terms of making things easy, we can change the way we store the files. All our features are stored in .h5 files accessible with deepdish library, and each file contains both the features and the label annotations. As far as I see from the index files, your preferred way is to store all features and annotations in different files. Should we make that change before publishing the dataset in Zenodo as well? Also, instead of using .h5 files, should we consider any other format like .npy?

As the ISMIR deadline is approaching, I also would like to ask about a tentative schedule for finishing this. Please let us know your ideas, and we can discuss the easiest way to take the next steps.

rabitt commented 4 years ago

Note that Da-TACOS has no audio, only precomputed features. I'm not sure how that will be encoded in the table of all datasets? #203

We can add a new emoji key to indicate that the audio is available as "features only". By the way, this is the case for the MSD as well (#81 ). Open to suggestions for the emoji! My first ideas are one of:

rabitt commented 4 years ago

@furkanyesiler -

How do you think we should proceed with integrating Da-TACOS here? Currently, all the files are stored in a Google Drive folder and can be downloaded with the script proposed in our repo. In my opinion, to make things more stable, we should finish publishing all the files in Zenodo as the first step.

Having the files on Zenodo will definitely make things easier on the mirdata side, and if that's what your eventual plan is, I'd say definitely best to add the mirdata integration once this is complete.

In terms of making things easy, we can change the way we store the files. All our features are stored in .h5 files accessible with deepdish library, and each file contains both the features and the label annotations. As far as I see from the index files, your preferred way is to store all features and annotations in different files. Should we make that change before publishing the dataset in Zenodo as well? Also, instead of using .h5 files, should we consider any other format like .npy?

It's actually no problem to have the features and annotations in the same file! In general, we don't ask people to change their dataset's structure for mirdata - we write our code around each dataset's structure. The PR is temporarily closed, and a bit out of date now, but take a look at #149 for an example that has just one file per "track".

As the ISMIR deadline is approaching, I also would like to ask about a tentative schedule for finishing this. Please let us know your ideas, and we can discuss the easiest way to take the next steps.

I think it would be great if the loader was available for people to use for their ismir papers. However, it's totally up to you and your workload. Note that we're also happy to help with the PR to speed things up.

furkanyesiler commented 4 years ago

Thanks for your reply @rabitt!

Having the files on Zenodo will definitely make things easier on the mirdata side, and if that's what your eventual plan is, I'd say definitely best to add the mirdata integration once this is complete.

Yes, this is definitely something we are planning to do. We'll speed things up for this.

It's actually no problem to have the features and annotations in the same file! In general, we don't ask people to change their dataset's structure for mirdata - we write our code around each dataset's structure. The PR is temporarily closed, and a bit out of date now, but take a look at #149 for an example that has just one file per "track".

Not changing the data structure would definitely make things easier for us. The only issue there is the using the deepdish library for reading the files. Do you think we can add that library to setup.py as an extra requirement?

I think it would be great if the loader was available for people to use for their ismir papers. However, it's totally up to you and your workload. Note that we're also happy to help with the PR to speed things up.

I'll talk to Da-TACOS team to see what an approximate deadline can be for us. People who want to use the dataset for their ISMIR papers can still download it with the script we provide on our repo so I guess that shouldn't be a problem. I'll keep you updated as soon as I know more!

rabitt commented 4 years ago

Not changing the data structure would definitely make things easier for us. The only issue there is the using the deepdish library for reading the files. Do you think we can add that library to setup.py as an extra requirement?

Yes, we can add deepdish as a dependecy - I'm sure this will not be the only dataset requiring h5 support. Quick question - do you have any opinions on h5py vs deepdish?

I'll talk to Da-TACOS team to see what an approximate deadline can be for us. People who want to use the dataset for their ISMIR papers can still download it with the script we provide on our repo so I guess that shouldn't be a problem. I'll keep you updated as soon as I know more!

Sounds good!

magdalenafuentes commented 4 years ago

Hey @furkanyesiler, looking forward to having Da-Tacos integrated in mirdata! The MTG has wonderful datasets and we're really happy to have them in the library!

+1 for moving the dataset to Zenodo, and +1 to @rabitt question on deepdish vs h5py. Also, let me call your attention to our new_loader PR template (see contributing), which should be useful when you open your PR. Thanks again for contributing!

rabitt commented 4 years ago

@furkanyesiler wanted to check in and see if you guys have time to start thinking about this again. Let me know if there's anything we can do to help!

furkanyesiler commented 4 years ago

Hi @rabitt. Many apologies for the delay, it's been crazy with many deadlines. I'll post an update here as soon as possible. Thanks for the reminder!

PRamoneda commented 3 years ago

The new loader is in #434!!!!