gloriadazevedo / ORIE4741_Project

3 stars 4 forks source link

Final Peer Review (an533) #12

Open AnthonyNiznik opened 7 years ago

AnthonyNiznik commented 7 years ago

Summary

The purpose of the project is to improve the surveys used in online/speed dating events in order to increase the accuracy of predicting matches between people. The team tries to accomplish this by identifying which characteristics are important in determining which couples match. The team models what attributes men look for as well as what women look for in a person they date. The data that is explored consists of post-event surveys from Kaggle.

Positives

  1. The executive summary presents the problem and findings very clearly. It also explains the limitations of the model as well as the reasoning behind choosing the final model.

  2. I like how you addressed concerns the reader may have about the imputation of the data.

  3. The last few sentences in your conclusion section are super solid. Summarizing your findings in this way is very effective.

  4. The histogram of attractiveness scores is simple, yet effective. It shows that the "No" response is approximately normal while the "Yes" response is left-skewed.

Improvements

  1. In the paper, it is stated: "In this analysis (logistic) all the values with missing data have been removed from the data when computing the coefficients for the logistic regression, as it would be misleading to impute the values of scores given to partners or assign the average value when the answer has been left blank by the participant." For your logistic model, you retain about 83% of the data, but there can be a lot of information you are missing from the other 17%. For missing data (in the feature space), perhaps you could try using K-nearest neighbors in order to impute the values that seemed to be randomly left blank. Of course, you should look at the data to see if this imputation makes sense (i.e., people with the same classification tend to have this particular feature). I do not think it is misleading if you provide legitimate reasoning for your assumptions.

  2. For the conclusion part when you state "the same for both training and making predictions so there’s a high probability that the model is over-fitting the data." are you talking about the logistic model and the fact that you trained and tested on the same data? Perhaps this should be made more clear.

  3. One grammar error in section 4 (Using "try" and "tried" in same paragraph; tenses should be the same) as an aside.

claraong commented 7 years ago

From this paper [http://www.litech.org/~wkiri/Papers/wagstaff-missing-ifcs04.pdf], it says " In general, clustering methods cannot analyze items that have missing data values". The team also said this in the report.

AnthonyNiznik commented 7 years ago

Hi Clara,

What I am stating for my first point in the improvements section is using K-nearest neighbors algorithm in order to impute a feature in the X matrix (assuming that all the data has the classification at least). Please refer to the paper, "http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.3558&rep=rep1&type=pdf". I am not suggesting to use a clustering method. Perhaps you were thinking of K-means clustering. Please refer to, "http://stats.stackexchange.com/questions/56500/what-are-the-main-differences-between-k-means-and-k-nearest-neighbours" to see the difference between K-means clustering and K-nearest neighbors.

claraong commented 7 years ago

Oops! Realized my error. Thanks! [http://stats.stackexchange.com/questions/200273/k-nearest-neighbour-imputation-of-missing-values]