Data Collection - Githubissues

Probably we will get better data using APIs but it might have some limitations,right ? And we will need large data since we have lots of song mood descriptor classes. To prevent bias, it's likely that we need so many artist-song (maybe playlist) pairs/triples data to increase variability wrt genres (and some other predictors probably, we should do a further analysis on that).
I have completed cleaning on small-scale data and the number of unique artist-song pairs is around 500k now. Here is a question: What should be the proportion between our usage of newer and older data? The released_year's of songs probably can determine patterns on the data. Actually they define global trends.
Under these considerations, maybe we should collect more data beforehand to prevent target class imbalances or bias in the data in general.
There are some large SQL dumps and tar.gz files on the net. I think about processing these. What do you think?

mf-caglar / song_analysis_project