Open smadha opened 7 years ago
Number of question answered by a user - [done] Number of users who answered this question - [done]
Idea 1: Number of questions answered by a user per category/Number of questions answered by the user with the same question tag Idea 2: Number of top quality answers per question
boolean to determine if user have answered this question before
Cartesian product of user tags with question tags. For example if all possible users tags are [u1,u2] and all possible question tags are [q1,q2]. We create a product feature [u1q1,u1q2,u2q1,u2q2]. Now if we have a user with tag u1 and he answers a question with tag q2 above vector will be [0,1,0,0]
If the user has answered question before, give that pair(user,question) a label of 0 and remove the duplicate records with label 1. (up for discussion)
[DONE] Feature for capturing user history using question tags. Example is u1 has answered 1 question from total of 5 question categories. His feature will look like [1,0,0,0,0] if question answered is from tag1. Similarly we calculate feature for questions not answered if user has not answered 2 questions say from tag 3,5 feature looks like [0,0,1,0,1]
Same for question
For a pair u_i, q_j calculate average similarity score between users who has answered question q_j with u_i call it user_sim_answered. Calculate average similarity score between users who has NOT answered question q_j with u_i call it user_sim_not_answered.
Similarly we can get two values by using questions as ques_sim_answered, ques_sim_not_answered. [DONE]
Cartesian product of user character tags -uc_i with question character tags -qc_j and choosing best pairs - uc_i,qc_j on basis of how well they distinguish between answered and unanswered questions
Every feature should have consistent meaning across all training examples.
For example I created a feature for history of a question "tag vector of users who answered a question q", for each question this vector had value set at only one/two indices and it changed for all questions. This should confuse classifier as for same class(as of now 0/1) we have very different values. So I changed the feature for example (u,q,label) "num of time user with similar tags as u answered question q", "num of time user with different tags as u answered this question q"
Additional features-
Can we normalise features?
Below might help too, but can be overfitting -