Open rabitt opened 4 years ago
Bonus points if this can be done without loading any metadata, but simply by parsing track_ids
For most namespace, I'd actually vote for this to be hardcoded in the module. If it's created by parsing e.g. track objects over .track_ids()
, it would require instantiating the whole dataset.
@rabitt what do you mean, hardcoded? as constants? directly in the code of the module or in a JSON
file? if the former, i'm a bit concerned about import time. If the latter, that intertwines with the long-term plan of #153 of hosting large files with LFS. It's also JAMS namespace territory
also, what if the list is wrong?
@rabitt i'm thinking about this a little bit more. My opinion is that this feature is overkill for small datasets like TinySOL (16 instruments), MSDB (8 instruments) or GTZAN Genre (10 genres). But i think it would be crucial for larger ones such as OrchideaSOL (multilabel, hundreds of techniques). If you want, i can rewrite #174 (closed OrchideaSOL PR) with a parse_track_id
function so that the list of playing techniques comes down to
def unique(track_id_info):
np.unique([parse_track_id(track_id)[track_id_info] for track_id in track_ids()])
Then, we could have a sorted list of instruments (orchideasol.unique("instrument")
), techniques orchideasol.unique("technique")
, dynamics orchideasol.unique("dynamics")
etc.
Does this sound like a plan?
@rabitt what do you mean, hardcoded? as constants? directly in the code of the module or in a JSON file?
I mean a very simple hardcoded function that is verified by tests. For example:
# in mirdata.gtzan_genre
def genres():
return [
'blues',
'classical',
'country',
'disco',
'hip-hop',
'jazz',
'metal',
'pop',
'reggae',
'rock'
]
OR (when possible)
def genres():
return jams.schema.values('tag_gtzan')
and in the tests:
def test_track():
...
dataset = gtzan_genre.load()
genres = gtzan_genres.genres()
assert len(genres) == 10
for track in dataset:
assert track.genre in genres
also, what if the list is wrong?
then we shouldn't merge it :) This is where the tests are important. A hardcoded list can be wrong just as a json file on disk can be wrong.
@rabitt i'm thinking about this a little bit more. My opinion is that this feature is overkill for small datasets like TinySOL (16 instruments), MSDB (8 instruments) or GTZAN Genre (10 genres). But i think it would be crucial for larger ones such as OrchideaSOL (multilabel, hundreds of techniques).
I actually think it is crucial even for the smaller datasets - imaging using a tag in a classification setup - it's important to somewhere have a complete list of tags, even if that list is small.
If you want, i can rewrite #174 (closed OrchideaSOL PR) with a parse_track_id function so that the list of playing techniques comes down to
def unique(track_id_info): np.unique([parse_track_id(track_id)[track_id_info] for track_id in track_ids()]) Then, we could have a sorted list of instruments (orchideasol.unique("instrument")), techniques orchideasol.unique("technique"), dynamics orchideasol.unique("dynamics") etc.
Does this sound like a plan?
Sure, that can be a solution for OrchideaSOL - I'm fine with this dataset.tags()
(or whatever the attribute is called) method being implemented however it most makes sense for the dataset. The one thing I want to avoid is having these functions need to load the entire dataset of Track objects in order to return a list that will never change.
Brought up by @lostanlen in an offline discussion:
For datasets which provide "tag-like" data, there is no list of all possible values for those tags. We should add, within their respective modules as module-level functions:
gtzan_genre.get_genres()
tinysol.get_instruments()
... others?