Add "dry-run" option for evaluation/testing

It would be nice to add an option to pass a file with the correct labels for (some of) the objects.

Currently, we also allow to load "gold data" with the purpose of aiding and speeding up the estimation of quality for the workers. The evaluation data will not be used in the same way we currently use gold data: Specifically, the evaluation data will never be used during training. Instead, we will use the evaluation data in order to estimate the quality of the estimates generated by the algorithm.

How it should work

We load another file, say evaluation-data, which will have similar format with the correctfile (a pair of objectid, label per line)
We store the evaluation label in the Datum object as a new field.
After running the algorithm, when we print the final results, we also compute the following a. The labels for the objects as computed by the algorithm, compared to the evaluation data b. The quality of the workers as computed by the algorithm, compared to the evaluation data c. The priors for the categories as computed by the algorithm, compared to the evaluation data
Labels for the objects.

Right now, we create the files: dawid-skene-results.txt naive-majority-vote.txt differences-with-majority-vote.txt object-probabilities.txt

For each of these files, we should also add a column with the correct label, from the evaluation data.

We should also add columns with the classification cost for each example.

The classification cost is computed as the cost of misclassifying an object of class A (taken from the evaluation data) into class B (taken from the label(s) assigned by the workers). The cost is based on the costs file. The classification cost can be computed in multiple ways:

a. Using the maximum-likelihood category from DawidSkene (EvalCost_DS_ML) b. Using the maximum-likelihood category from Majority (EvalCost_MV_ML) c. Using the "soft label" category from DawidSkene (EvalCost_DS_Soft) d. Using the "soft label" category from Majority (EvalCost_MV_Soft)

For a. and b. we simply take the class (as reported in the dawid-skene-results.txt and naive-majority-vote.txt) and print the cost. For cases c and d, we use the "soft label" for the object and we computed the weighted cost. For example, if the object has an evaluation class A, the algorithm returns 60% A, 30% B, 10% C, with costs A->A = 0, A->B = 1, A->C = 2, then the classification cost is 0.6_0+0.3_1+0.1_2 = 0.5. For an object with 90% A, 9% B, 1% C the classification cost is 0.9_0+0.09_1+0.01_2 = 0.11.

We should also generate an object-label-accuracy.txt report that will report: a. The confusion matrix of each technique (dawid-skene maxlikelihood, dawid-skene soft, majority maxlikelihood, majority soft) b. The average misclassification cost of each algorithm.

The quality of the workers

We should create an extra confusion matrix for each worker, which should be based solely on the assigned labels and the actual evaluation data. Then we can list the estimated quality of the worker based on the evaluation data, next to the estimates that we have for the confusion matrix, the quality of the worker, etc.

We should modify these files, accordingly worker-statistics-detailed.txt worker-statistics-summary.txt

To compute the evaluation-data confusion matrix of a worker, we go through the objects labeled by this worker; we check that is the evaluation label of this example; we check what is the assigned label from the worker. Based on these, we compute the confusion matrix of the worker, the quality (expected and optimized) of the worker, etc.

The priors for the categories

That is the simplest part. We should list in the priors.txt file not only the estimated priors, but also the actual priors, based on the prevalence of each category in the evaluation data.

ipeirotis / Get-Another-Label