mir-dataset-loaders / mirdata

Python library for working with Music Information Retrieval datasets
https://mirdata.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
362 stars 58 forks source link

list of available tags #223

Open rabitt opened 4 years ago

rabitt commented 4 years ago

Brought up by @lostanlen in an offline discussion:

For datasets which provide "tag-like" data, there is no list of all possible values for those tags. We should add, within their respective modules as module-level functions:

lostanlen commented 4 years ago

Bonus points if this can be done without loading any metadata, but simply by parsing track_ids

rabitt commented 4 years ago

For most namespace, I'd actually vote for this to be hardcoded in the module. If it's created by parsing e.g. track objects over .track_ids(), it would require instantiating the whole dataset.

lostanlen commented 4 years ago

@rabitt what do you mean, hardcoded? as constants? directly in the code of the module or in a JSON file? if the former, i'm a bit concerned about import time. If the latter, that intertwines with the long-term plan of #153 of hosting large files with LFS. It's also JAMS namespace territory

also, what if the list is wrong?

lostanlen commented 4 years ago

@rabitt i'm thinking about this a little bit more. My opinion is that this feature is overkill for small datasets like TinySOL (16 instruments), MSDB (8 instruments) or GTZAN Genre (10 genres). But i think it would be crucial for larger ones such as OrchideaSOL (multilabel, hundreds of techniques). If you want, i can rewrite #174 (closed OrchideaSOL PR) with a parse_track_id function so that the list of playing techniques comes down to

def unique(track_id_info):
    np.unique([parse_track_id(track_id)[track_id_info] for track_id in track_ids()])

Then, we could have a sorted list of instruments (orchideasol.unique("instrument")), techniques orchideasol.unique("technique"), dynamics orchideasol.unique("dynamics") etc.

Does this sound like a plan?

rabitt commented 4 years ago

@rabitt what do you mean, hardcoded? as constants? directly in the code of the module or in a JSON file?

I mean a very simple hardcoded function that is verified by tests. For example:

# in mirdata.gtzan_genre
def genres():
    return [
        'blues',
        'classical',
        'country',
        'disco',
        'hip-hop',
        'jazz',
        'metal',
        'pop',
        'reggae',
        'rock'
    ]

OR (when possible)

def genres():
    return jams.schema.values('tag_gtzan')

and in the tests:

def test_track():
    ...
    dataset = gtzan_genre.load()
    genres = gtzan_genres.genres()
    assert len(genres) == 10
    for track in dataset:
        assert track.genre in genres

also, what if the list is wrong?

then we shouldn't merge it :) This is where the tests are important. A hardcoded list can be wrong just as a json file on disk can be wrong.

@rabitt i'm thinking about this a little bit more. My opinion is that this feature is overkill for small datasets like TinySOL (16 instruments), MSDB (8 instruments) or GTZAN Genre (10 genres). But i think it would be crucial for larger ones such as OrchideaSOL (multilabel, hundreds of techniques).

I actually think it is crucial even for the smaller datasets - imaging using a tag in a classification setup - it's important to somewhere have a complete list of tags, even if that list is small.

If you want, i can rewrite #174 (closed OrchideaSOL PR) with a parse_track_id function so that the list of playing techniques comes down to

def unique(track_id_info): np.unique([parse_track_id(track_id)[track_id_info] for track_id in track_ids()]) Then, we could have a sorted list of instruments (orchideasol.unique("instrument")), techniques orchideasol.unique("technique"), dynamics orchideasol.unique("dynamics") etc.

Does this sound like a plan?

Sure, that can be a solution for OrchideaSOL - I'm fine with this dataset.tags() (or whatever the attribute is called) method being implemented however it most makes sense for the dataset. The one thing I want to avoid is having these functions need to load the entire dataset of Track objects in order to return a list that will never change.