mitodl / edx2bigquery

Tool to convert & load data from edX platform into BigQuery
GNU General Public License v2.0
29 stars 29 forks source link

Created unit testing for analysis and bigquery2pandas #54

Open CGNx opened 8 years ago

CGNx commented 8 years ago

Unit testing works by comparing previous runs of a given analysis with the current run in a single BigQuery query (by appending the last analysis run and comparing the two and then appending the difference to the final unit test table). The test courses are kept private.

bigquery2pandas is a library for interacting with bigquery using pandas. SQL2df is the most frequently used function and will create a correctly typed, correctly ordered, pandas dataframe from a SQL query. Estimated time to completion and other useful features are supported.

CGNx commented 8 years ago

HOW TO RUN ANALYSIS TESTS

The tests tell us the percentage change on average for a sample of columns for five test courses whenever a change is made. Here is drive code to run tests:

from edx2bigquery.edx2bigquery.bigquery2pandas import analysis_unit_tests

test_course_ids = analysis_unit_tests.fetch_test_course_ids()

update_msg = "Whatever the most recent update to the code is - keep it short, this will be added to the table" analysis_unit_tests.ans_coupling_test1('dataset', test_course_ids=test_course_ids, what_changed=update_msg) analysis_unit_tests.sab_test1("dataset", test_course_ids=test_course_ids, what_changed=update_msg) analysis_unit_tests.cameo_test1("dataset", test_course_ids=test_course_ids, what_changed=update_msg) print 'Done'

WILSON'S INTERVAL FOR RANKING CAMEO CHEATING AND COLLABORATION CAMEO - show_ans_before Collaboration - ans_coupling

The Wilson's Interval Score provides a single value which ranks master, harvester pairs. This score combines a negative and positive score for each student as a confidence-based measure.

The interpretability of the ranking is based on the features used to compute the Wilson's Interval score. In the "show_ans_before" case, the score ranks user pairs based on their likelihood of copying via CAMEO. In the "ans_coupling" case, the score ranks user pairs based on their likelihood of answering problems together in pairs or groups (whether by copying or working together).

IMPORTANT: The positive and negative scores are generated by first normalizing the features, then combining them linearly with weights. How are these weights computed? A boosted logistic regression classifer with regularization (with Cross-Validation to find parameters) is trained on a randomly sampled 1 million master, harvester pairs. CAMEO cheating labels found using a hand-tuned composite of five filtering algorithms are used as binary lables. The training set uses the same features as those comprising the negative and positive scores. Features are standardized using minimax. This process is repeated 1000 times. The trained weights are also standardized at each iteration, and then all 1000 trained, standardized weights are avearged to produce the final weights. These weights represent the predictive power of each of the features and are used in the linear combination for the positive and negative score. The positive and negative score are combined using Wilson's interval to produce the final CAMEO ranking. Since the weights are trained on CAMEO labels, not collaboration labels, the Wilson's Interval Ranking is optimized for "show_ans_before" not "ans_coupling." However, the two tables are nearly identical in structure, only with different semantics, making the Wilsons' Interval ranking highly relevant to "compute_ans_coupling".

The Wilson's Interval is used to sort these analysis tables. The top row in the table therefore represents the most statistically significant pair of users in the table, relevant to whichever metric the table captures.