tylerchr / parallel-database

An experimental parallelized database optimized for read performance
2 stars 0 forks source link

Convert Million Song Dataset (MSD) to a format with newline-separated tracks, tab-separated fields #2

Closed jaredririe closed 8 years ago

jaredririe commented 8 years ago

http://labrosa.ee.columbia.edu/millionsong/sites/default/files/tutorial1.py.txt

jaredririe commented 8 years ago

Here is the tab-separated file with the following columns: track_id, song_id, title, artist_name, artist_location, artist_hotttnesss, release, year, song_hotttnesss, danceability, duration, loudness, sample_rate, tempo

msd.txt

tylerchr commented 8 years ago

+1

jaredririe commented 8 years ago

It turned out that the "300GB dataset" also had a much leaner 700MB SQLite version. I downloaded it and ran some queries to generate this 100MB text file: https://www.dropbox.com/s/i5g0g7kjzhl6vei/msd.txt?dl=0

The database gave me access to these columns (note that a few are missing from the above list): track_id, title, release, artist_name, duration, artist_familiarity, artist_hotttnesss, year, track_7digitalid

The final column, track_7digitalid, is an integer that could potentially be used to access a short sample of the song. Imagine if we took advantage of this during our demo--not only could we display the highest rated song according to artist hotttnesss, we could play a short bit of it.