Last.fm dataset and MARD dataset

PRamoneda commented 4 years ago

Our first idea is to include most of MTG's datasets to have an easy and centralized way to get any of our datasets. Right now, the way to get and load each one is very heterogeneous, and there are more than 50 MTG datasets.

We are thinking to include in mirdata last.fm datasets http://ocelma.net/MusicRecommendationDataset/index.html and MARD dataset https://www.upf.edu/web/mtg/mard. These datasets are very oriented to recommendations systems, and they don't have audio files.

That's why they don't need a track class—only several functions to satisfy typical use cases.

Would there be any problem with including them?

Thank you in advanced!

rabitt commented 4 years ago

This is similar to the case of the million song dataset, which also doesn't have audio. #81

I think it could still have a track class without audio, assuming there is track-specific annotations. However, if the dataset is e.g. a relational database, we could have a loader without a Track class, and think about a different type of structure for that.

Ping @andreasjansson who was thinking about this when we first designed this repository!

nkundiushuti commented 4 years ago

@rabitt , @PRamoneda , @andreasjansson there are 2 issues here

datasets which distribute solely annotations; audio for tracks is not available due to licensing issues/other issues
datasets with annotations which do not revolve around the Track paradigm; they are basically databases, tables with an N-to-N relationship

first issue : millionsongdataset and also maybe acousticbrainz, otherdatasets for which we distribute the features and not the audio : this can probably be solved with minimal changes in mirdata

second issue: a bit more complex. in lastfm1k we can still use the Track paradigm, because we have songs which have been listened by N users at N times. however, in lastfm360k and other datasets: https://www.upf.edu/web/mtg/semantic-similarity we can't use the Track object anymore and we need to think of develop other objects. the question is whether we should develop this parallel to mirdata or if we can integrate it: maybe we can create two categories mirdata and mirdata-tuples and put all these cases there. for the tests it will be a bit of a mess... I suggested the name mirdata-tuples because all these non-audio datasets seem to return tuples user-mboxsha1 \t musicbrainz-artist-id \t artist-name \t plays user-mboxsha1 \t gender (m|f|empty) \t age (int|empty) \t country (str|empty) \t signup (date|empty) then maybe we need to provide methods to filter a particular artist or user from the tuples (what is needed in music recommendation).

an example of lastfm data loader here: https://github.com/benfred/implicit/blob/master/examples/lastfm.py

magdalenafuentes commented 4 years ago

Hey @nkundiushuti!

I also like the idea of including the recommendation datasets, and I'm open to include an alternative object to the loaders, so they have either a track or a tuple. Though I'm not familiar enough with these datasets to know if a tuple object would be enough to include most of them.

I quickly asked @andrebola who works in recommendation and he mentioned this package which seems relevant, maybe we can take inspiration from it?

andrebola commented 4 years ago

Hi all! probably it's very difficult to support all the possible alternatives and corner cases. I think is a good idea to follow some library for recommendations.

I just wanted to add to what Marius commented that:

some datasets can contain temporal information of when a user interacted with an item, which is important for example for splitting the data or in sequential recommendations
other datasets provide the session in which a user listened to an idem, in that case, you can have something like <user-id, session-id, artist, track, timestamp>

I think lastfm-1k provides the timestamps, so it can't be simplified to <user, track, N times>. These are some datasets that can be used for reference: LFM-1b, #nowplaying-rs, #nowplaying and 30music.

rabitt commented 4 years ago

I think this starts to hit on some new core issues we haven't yet faced in mirdata! I've just opened #277 and #276 to start thinking about how to cleanly handle these cases

nkundiushuti commented 4 years ago

a similar dataset that fits this scenario is: https://ddmal.music.mcgill.ca/research/The_Music_Listening_Histories_Dataset_(MLHD)/

mir-dataset-loaders / mirdata

Last.fm dataset and MARD dataset #272