mitodl / edx2bigquery

Tool to convert & load data from edX platform into BigQuery
GNU General Public License v2.0
29 stars 29 forks source link

Fix median_dt, add corr filter, fix false pos sybils #25

Closed CGNx closed 9 years ago

CGNx commented 9 years ago

1) Fixed median_dt computation which was dropping people because median_dt was null if percent_show_ans_before < 50%. (I used your example blindly before and it turned out I wasn't avoiding nulls at all. Since you were computing overall median and I was computing median for each cameo/shadow partition, a different strategy was needed) - if in the future you ever need to compute the median and avoided null values with the added issue of partitions, this code may be useful to look at. 2) Added corr > 0 (no correlation filtering before ) 3) No longer include pairs with no problems in common in show_ans_before 4) Fixed the false positives caused by incorrectly including all non-certified instead of only allowing harvesters if matched cameo is in table and doesn't get filtered out.