liaoandi / MACS30200proj

Directory for MACS 30200 Computational Research
0 stars 1 forks source link

Review - Kanyao #3

Open khan1792 opened 6 years ago

khan1792 commented 6 years ago

A well-designed project! The data and method sections are impressive since your brief sentences inform a lot of concerns when you manipulate your data and build model. They make the results more convincing.

I also have some suggestions and questions.

  1. The fist question is about the response and the predictor. Your response (h-index) is calculated by the number of the answer and the number of agree, and your predictors also contain them. Why you use machine learning model when you can directly calculate it? If you just use topic, article, following and follower to predict whether they are expertise, it is more reasonable. If you do so, some predictors such as thanked and favorite should not be used as well. This is because you know the number of answer and agree once you know the number of favorite and thanked.
  2. You may move the data section to the bottom left and put the summary statistics at the middle of the poster.
  3. I'm confused by the y axis of Chart 1. What does the value of the y axis mean? The number of FN, FP and correct prediction? Besides, I think confusion matrix that shows percentages of precision and recall is a much better choice for visualization instead of have a plot of FN, FP and prediction.
  4. The demographic information contained in the result session can be moved to the summary statistics part.
liaoandi commented 6 years ago

Thanks for your review! The first question will be fully explained in my paper. In short, the h-index is not available to all user records in the dataset - only 1000 out of 80000 have h-index. And writing answers and receiving agreed should be viewed as a result of the complicated social interaction. It might not be the most ideal way to build the classifier, but it might be the best I can do given the time limit. Also, I should reconsider the location of graphs, and the way to present my result. I chose the number of FN, FP and correct prediction as it is an extremely unbalanced dataset. If I use percentages, all of the metrics will be less than 1%.