jimmychen623 / genre_classification

ORIE4741 Project : Classifying genre of songs
4 stars 1 forks source link

Final Peer Review - mrl233 #15

Open mluop opened 6 years ago

mluop commented 6 years ago

The goal of this project is to classify the genre of a song given musical data about the song. The team uses a very big, messy dataset consisting of song data and another dataset consisting of genre labels. The team uses many models and techniques from class to approach this problem.

Strengths:

Improvements:

jimmychen623 commented 6 years ago

@mluop Thank you for your feedback! There are various de facto methods to deal with a class imbalance, subsampling being one of them. We didn't feel it was worthwhile to train our model on the true set of data (with the class imbalance). Judging from the small frequency of most genres which sit around 5%, the model would almost surely just ignore these genres as possible predictions. We did not empirically prove this but judging from our baseline models and initial data analysis, we had strong reason to believe this. At the end, and I think this is mentioned in the report, we did evaluate the model trained on subsampled data and found that it performed just as well on the representative dataset so we're relatively confident the model we trained on the subsampled data generalizes.

Regarding the generalizability of our results, I'm not sure what the Hoeffding bound has to do with our problem. The Hoeffding bound is a function of a fixed hypothesis, a number n denoting the dataset size, and an epsilon. The Hoeffding bound is something that is evaluated on a single hypothesis model, so it has no relation to us using a wide variety of models. I'd appreciate it if you could clarify what you mean :)

I do agree with your statement regarding the union bound. The larger the set of hypothesis models we have, the less we are able to upper-bound the probability that the difference between our in sample error and out of sample error is less than some epsilon. However, applying the union bound in this case is not very insightful because even if we did limit our results to those of one model (say a classification tree), there are still a large number of classification tree models, each with their own parameters. The union bound badly over counts and the upper-bound is meaningless.