mir-dataset-loaders / mirdata

Python library for working with Music Information Retrieval datasets
https://mirdata.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
357 stars 59 forks source link

OpenMic2018 #253

Closed rabitt closed 2 years ago

rabitt commented 4 years ago

https://zenodo.org/record/1432913#.XqAkvdMzZ24 https://github.com/cosmir/openmic-2018

bmcfee commented 2 years ago

This is embarrassing :grin: mind if I take a crack at it?

magdalenafuentes commented 2 years ago

Please do!

bmcfee commented 2 years ago

Working on this in between ismir sessions today. One question: what do the benevolent maintainers think about using pandas instead of raw csv munging? I don't want to add unnecessary dependency bloat, but the openmic annotations and metadata are stored in a few (large) CSV files that would be much easier to process and align if loaded as dataframes. (Plus I have general misgivings about writing your own csv parser.)

bmcfee commented 2 years ago

... actually I just realized that there's already an implicit dependency on pandas through jams.

bmcfee commented 2 years ago

Ok, I have something approaching a prototype working. Before I go much further with it, I want to solicit some feedback.

When you dig into it, openmic is a fairly complex dataset, and I'm not sure how much of it makes sense to expose through the mirdata API. So far, what I have is able to expose raw audio, pre-computed vggish features (similar to how datacos does it), and a big whack of metadata largely imported from FMA. (I haven't added the properties for these yet.)

The labels are pulled from the "aggregated-labels.csv" file, which only encodes the mean ratings for observed instrument/clip interactions. This is a continuous (but generally quantized) value between 0 and 1 for label absence or presence; otherwise a nan is used to indicate a lack of observation.

We also have the number of raters available, but I think it's not worth exposing that. Most of the time it's 3; there are a handful of 1s, and a long tail that reaches up into the hundreds. (I believe these were our honeypot examples.)

All of this, together with the pre-generated partition, gives a track metadata structure like the following:


In [112]: df['000046_3840']
Out[112]: 
{
    'track_id': 46,
    'album_id': 4.0,
    'album_title': 'Niris',
    'album_url': 'http://freemusicarchive.org/music/Chris_and_Nicky_Andrews/Niris/',
    'artist_id': 4,
    'artist_name': 'Nicky Cook',
    'artist_url': 'http://freemusicarchive.org/music/Chris_and_Nicky_Andrews/',
    'artist_website': nan,
    'license_image_file': 'http://i.creativecommons.org/l/by-nc-nd/3.0/88x31.png',
    'license_image_file_large': 'http://fma-files.s3.amazonaws.com/resources/img/licenses/by-nc-nd.png',
    'license_parent_id': nan,
    'license_title': 'Attribution-NonCommercial-NoDerivatives (aka Music Sharing) 3.0 International',
    'license_url': 'http://creativecommons.org/licenses/by-nc-nd/3.0/',
    'tags': '[]',
    'track_bit_rate': 256000.0,
    'track_comments': 0,
    'track_composer': nan,
    'track_copyright_c': nan,
    'track_copyright_p': nan,
    'track_date_created': '11/26/2008 01:49:53 AM',
    'track_date_recorded': '1/01/2008',
    'track_disc_number': 1,
    'track_duration': '01:44',
    'track_explicit': nan,
    'track_explicit_notes': nan,
    'track_favorites': 0,
    'track_file': 'music/WFMU/Nicky_Cook__Chris_Andrews/Niris/Nicky_Cook__Chris_Andrews_-_08_-_Yosemite.mp3',
    'track_genres': [
        {'genre_id': '76', 'genre_title': 'Experimental Pop', 'genre_url': 'http://freemusicarchive.org/genre/Experimental_Pop/'},
        {'genre_id': '103', 'genre_title': 'Singer-Songwriter', 'genre_url': 'http://freemusicarchive.org/genre/Singer-Songwriter/'}
    ],
    'track_image_file': 'https://freemusicarchive.org/file/images/albums/Chris_and_Nicky_Andrews_-_Niris_-_2009113012134556.jpg',
    'track_information': nan,
    'track_instrumental': 0,
    'track_interest': 252,
    'track_language_code': 'en',
    'track_listens': 171,
    'track_lyricist': nan,
    'track_number': 8,
    'track_publisher': nan,
    'track_title': 'Yosemite',
    'track_url': 'http://freemusicarchive.org/music/Chris_and_Nicky_Andrews/Niris/Yosemite',
    'start_time': 3.84,
    'split': 'train',
    'accordion': nan,
    'banjo': nan,
    'bass': nan,
    'cello': nan,
    'clarinet': 0.1710499999999999,
    'cymbals': nan,
    'drums': nan,
    'flute': 0.0,
    'guitar': nan,
    'mallet_percussion': nan,
    'mandolin': nan,
    'organ': nan,
    'piano': nan,
    'saxophone': nan,
    'synthesizer': nan,
    'trombone': nan,
    'trumpet': 0.0,
    'ukulele': nan,
    'violin': nan,
    'voice': nan
}

Note that the openmic labels are the last twenty fields, preceded by the split identifier (train or test), and then the FMA metadata.

So far so good. The question-mark for me is what to do with the disaggregated rating data. This would be super useful to have for crowdsourcing research (less so for the instrument recognition task, unless you're really digging into personalized models), so I'd like to include it if possible. It looks something like the following:

In [64]: df = pd.read_csv('openmic-2018-individual-responses.csv', index_col=0)

In [65]: df.head(10)
Out[65]: 
              worker_id  worker_trust channel instrument  response
sample_key                                                        
000046_3840    b1281110        0.8146    a163      flute       0.0
000046_3840    67a2a2bf        0.9091    a163    trumpet       0.0
000046_3840    9c5f715c        0.9167    a163    trumpet       0.0
000046_3840    dddd907a        0.7273    125a    trumpet       0.0
000046_3840    892f3c66        0.7692    a163   clarinet       0.0
000046_3840    af7e56ee        0.8000    a163   clarinet       1.0
000046_3840    91cdd5c5        0.7708    125a      flute       0.0
000046_3840    68de85ac        0.7207    a163      flute       0.0
000046_3840    2edc8001        0.7692    125a   clarinet       0.0
000135_483840  d975913e        1.0000    125a      voice       1.0
...

The issue is that it's rather large and unwieldy, so it might not be easy to pack into the per-track metadata object. (It also might not be appropriate to do so, eg if you wanted to query by annotator instead of by track.) Since different tracks will have different numbers of ratings, we can't just tack on additional columns. We might be able to do some kind of pivot/aggregate to pack all the ratings for a particular track into a nested object (kind of like we do with genre tags above) but it feels clumsy to me. Is there precedent elsewhere in mirdata for this sort of thing? Or do the benevolent maintainers have thoughts on how to proceed?

bmcfee commented 2 years ago

One more question: what do you think about raising an exception (or at least a warning) if a user asks for a random split instead of using the predefined one? I only ask because getting the openmic split right was a real pain, and i don't want people to mistakenly use a bad split.