Open PRamoneda opened 4 years ago
This is similar to the case of the million song dataset, which also doesn't have audio. #81
I think it could still have a track class without audio, assuming there is track-specific annotations. However, if the dataset is e.g. a relational database, we could have a loader without a Track class, and think about a different type of structure for that.
Ping @andreasjansson who was thinking about this when we first designed this repository!
@rabitt , @PRamoneda , @andreasjansson there are 2 issues here
first issue : millionsongdataset and also maybe acousticbrainz, otherdatasets for which we distribute the features and not the audio : this can probably be solved with minimal changes in mirdata
second issue: a bit more complex. in lastfm1k we can still use the Track paradigm, because we have songs which have been listened by N users at N times. however, in lastfm360k and other datasets: https://www.upf.edu/web/mtg/semantic-similarity we can't use the Track object anymore and we need to think of develop other objects. the question is whether we should develop this parallel to mirdata or if we can integrate it: maybe we can create two categories mirdata and mirdata-tuples and put all these cases there. for the tests it will be a bit of a mess... I suggested the name mirdata-tuples because all these non-audio datasets seem to return tuples user-mboxsha1 \t musicbrainz-artist-id \t artist-name \t plays user-mboxsha1 \t gender (m|f|empty) \t age (int|empty) \t country (str|empty) \t signup (date|empty) then maybe we need to provide methods to filter a particular artist or user from the tuples (what is needed in music recommendation).
an example of lastfm data loader here: https://github.com/benfred/implicit/blob/master/examples/lastfm.py
Hey @nkundiushuti!
I also like the idea of including the recommendation datasets, and I'm open to include an alternative object to the loaders, so they have either a track
or a tuple
. Though I'm not familiar enough with these datasets to know if a tuple
object would be enough to include most of them.
I quickly asked @andrebola who works in recommendation and he mentioned this package which seems relevant, maybe we can take inspiration from it?
Hi all! probably it's very difficult to support all the possible alternatives and corner cases. I think is a good idea to follow some library for recommendations.
I just wanted to add to what Marius commented that:
I think lastfm-1k provides the timestamps, so it can't be simplified to <user, track, N times>. These are some datasets that can be used for reference: LFM-1b, #nowplaying-rs, #nowplaying and 30music.
I think this starts to hit on some new core issues we haven't yet faced in mirdata! I've just opened #277 and #276 to start thinking about how to cleanly handle these cases
a similar dataset that fits this scenario is: https://ddmal.music.mcgill.ca/research/The_Music_Listening_Histories_Dataset_(MLHD)/
Our first idea is to include most of MTG's datasets to have an easy and centralized way to get any of our datasets. Right now, the way to get and load each one is very heterogeneous, and there are more than 50 MTG datasets.
We are thinking to include in mirdata last.fm datasets http://ocelma.net/MusicRecommendationDataset/index.html and MARD dataset https://www.upf.edu/web/mtg/mard. These datasets are very oriented to recommendations systems, and they don't have audio files.
That's why they don't need a track class—only several functions to satisfy typical use cases.
Would there be any problem with including them?
Thank you in advanced!