Feature engineering research

smadha commented 7 years ago

Additional features-

cluster ids for user and question

Can we normalise features?

Below might help too, but can be overfitting -

Number of words in each user/Question
Number of characters in each user/Question

smadha commented 7 years ago

Number of question answered by a user - [done] Number of users who answered this question - [done]

kushaank commented 7 years ago

Idea 1: Number of questions answered by a user per category/Number of questions answered by the user with the same question tag Idea 2: Number of top quality answers per question

smadha commented 7 years ago

boolean to determine if user have answered this question before

kushaank commented 7 years ago

Similarity of wordID/char ID/ tags between user and question [DONE]

smadha commented 7 years ago

Cartesian product of user tags with question tags. For example if all possible users tags are [u1,u2] and all possible question tags are [q1,q2]. We create a product feature [u1q1,u1q2,u2q1,u2q2]. Now if we have a user with tag u1 and he answers a question with tag q2 above vector will be [0,1,0,0]

kushaank commented 7 years ago

If the user has answered question before, give that pair(user,question) a label of 0 and remove the duplicate records with label 1. (up for discussion)

smadha commented 7 years ago

[DONE] Feature for capturing user history using question tags. Example is u1 has answered 1 question from total of 5 question categories. His feature will look like [1,0,0,0,0] if question answered is from tag1. Similarly we calculate feature for questions not answered if user has not answered 2 questions say from tag 3,5 feature looks like [0,0,1,0,1]

Same for question

smadha commented 7 years ago

For a pair u_i, q_j calculate average similarity score between users who has answered question q_j with u_i call it user_sim_answered. Calculate average similarity score between users who has NOT answered question q_j with u_i call it user_sim_not_answered.

Similarly we can get two values by using questions as ques_sim_answered, ques_sim_not_answered. [DONE]

smadha commented 7 years ago

Cartesian product of user character tags -uc_i with question character tags -qc_j and choosing best pairs - uc_i,qc_j on basis of how well they distinguish between answered and unanswered questions

smadha commented 7 years ago

Every feature should have consistent meaning across all training examples.

For example I created a feature for history of a question "tag vector of users who answered a question q", for each question this vector had value set at only one/two indices and it changed for all questions. This should confuse classifier as for same class(as of now 0/1) we have very different values. So I changed the feature for example (u,q,label) "num of time user with similar tags as u answered question q", "num of time user with different tags as u answered this question q"

smadha / MlTrio

Feature engineering research #6