Final Peer Review - mrl233

The goal of this project is to classify the genre of a song given musical data about the song. The team uses a very big, messy dataset consisting of song data and another dataset consisting of genre labels. The team uses many models and techniques from class to approach this problem.

Strengths:

The report itself is well formatted and well written.
I like that this is a practical problem to solve. Spotify is a service that I use daily, and their genre classification ability is a key value proposition to me.
I like that the team explicitly considers class imbalance as an issue. This is an important point, in order to avoid biasing the model.

Improvements:

Though they started with many songs in their dataset, the team subsampled in order to achieve class balance. It would be interesting to instead consider all the data available to them, and simply tune the class weights to effectively balance the classes.
The team tries a very wide variety of models and reports the test set accuracies for all of them. This limits the generalizability of their results (ala union bound and Hoeffding bound). It would be better to just test a promising few.
The team could use more summary statistics from the array data (section 4.2) such as various quantiles.

@mluop Thank you for your feedback! There are various de facto methods to deal with a class imbalance, subsampling being one of them. We didn't feel it was worthwhile to train our model on the true set of data (with the class imbalance). Judging from the small frequency of most genres which sit around 5%, the model would almost surely just ignore these genres as possible predictions. We did not empirically prove this but judging from our baseline models and initial data analysis, we had strong reason to believe this. At the end, and I think this is mentioned in the report, we did evaluate the model trained on subsampled data and found that it performed just as well on the representative dataset so we're relatively confident the model we trained on the subsampled data generalizes.

Regarding the generalizability of our results, I'm not sure what the Hoeffding bound has to do with our problem. The Hoeffding bound is a function of a fixed hypothesis, a number n denoting the dataset size, and an epsilon. The Hoeffding bound is something that is evaluated on a single hypothesis model, so it has no relation to us using a wide variety of models. I'd appreciate it if you could clarify what you mean :)

I do agree with your statement regarding the union bound. The larger the set of hypothesis models we have, the less we are able to upper-bound the probability that the difference between our in sample error and out of sample error is less than some epsilon. However, applying the union bound in this case is not very insightful because even if we did limit our results to those of one model (say a classification tree), there are still a large number of classification tree models, each with their own parameters. The union bound badly over counts and the upper-bound is meaningless.

jimmychen623 / genre_classification

Final Peer Review - mrl233 #15