smadha / MlTrio

CSCI-567 course project
Apache License 2.0
0 stars 0 forks source link

Data analysis #7

Open smadha opened 7 years ago

smadha commented 7 years ago

245752 LABELED SAMPLES 8095 questions 28763 users

27,324 questions answered 218,428 questions not answered

6182 users answered at least one question 23 users answered more than 50 questions 690 users answered more than 10 questions

5877 questions answered at least once 28 questions answered more than 30 times 705 questions answered more than 10 questions

30467 TEST SAMPLE

kushaank commented 7 years ago

Probability of user answering question again if they didn't answer it the first time: 0.029131121643 Probability of user not answering the question again if they didn't answer it the first time: 0.970868878357

kushaank commented 7 years ago

There isn't a case where the user answers the same question again

arpitaagrawal commented 7 years ago

Sample Data stats:

Number of users in list irrespective of the question was answered or not::: 27127/ 28763 Number of questions in list irrespective of the question was answered or not::: 7708 / 8095

most common question asked irrespective of it was answered or not:: [('8cc470e1c655b5bbf6e8684509b44205', 1016 times it was asked in the given sample)] most common user:: [('d66397df46f4e33cb608c322f751d884', 110 entries for the user are given for this user)] least common user:: ('09d89cf0a43005b22b015b24fe8b29ad', 1 entry is given for this user) least common question asked:: ('09698971cfdcca1b0eb9fd444edc596f', 1 entry is given for this question)

arpitaagrawal commented 7 years ago

The training sample seems to be skewed: Adding features after taking into account these labels(1/0) can increase the skewness in our features.