About DeltaAvg and ties

masterpiga commented 12 years ago

Hi all,

I am running the evaluation script by passing it the same file twice, i.e. the ranks and scores are identical across the two files. The file that I am using for the evaluation is posted here: https://gist.github.com/1779210

And this is the output that I get:

oracle :: Ranking: (primary) DeltaAvg = 0.83 (secondary) Spearman-Corr = 1.00 oracle :: Scoring: (primary) MeanAvgErr = 0.00 (secondary) RootMeanSqrErr = 0.00 LargeErrPerc = 0.0 SmallErrPerc = 100.0

Looking at the formula for AvgDelta, it seems like a perfect separation does not imply an AvgDelta of 1, but the description suggests otherwise. Am I correct?

Concerning ties, the instructions say that "no ties are allowed", for simplicity. But I would like to point out that the actual point labels are heavily quantized (there are 23 distinct scores in the development set). As a result, the way in which the quality scores are output AND the way in which ties are resolved can both have a major effect on the final evaluation. For example, a system that outputs quality scores with more than one decimal digit can be heavily penalized. I was wondering whether it wouldn't be appropriate to agree on common strategies to format the scores and to sort the segments that are in a tie. Or, as an alternative, decide to allow ties instead.

Thanks, cheers

Daniele

masterpiga commented 12 years ago

I have to rectify my previous post, in the sense that the ties in the actual oracle scores do not have any effect on the final evaluation. The only ties that have an effect are those in the scores produced by a system. So, if a system produces the same score for two sentences that have different scores according to the oracle, then the way in which they tie is broken can have an impact on the evaluation. Also, the way ties are broken has no effect on the evaluation provided that also the oracle scores are tied.

rsoricut commented 12 years ago

Hi all,

This is how I see the issues that Daniele brought up:

the AvgDelta of 0.83 seems correct to me. This metric does not assure that the oracle gets a score of 1 (the example we provided was just that, an example; moreover, we will re-write the example part in the README, since it's not entirely accurate). And the score does not seem low to me, since the random raking would get a 0.0 and a not-so-great ranking usually results in scores in the 0.3 range. Plenty of room until the ceiling is reached.
The "no ties allowed" refers to the ranking part only. The way we defined this shared-task, there are two separate sub-tasks: one is the scoring one, for which one can have any score and any number of ties, and it does not matter (from a MAE/RMSE perspective). Now, submitting scores for the scoring sub-task does not automatically qualify the submission for the sub-task of ranking. The sub-task of ranking is the one where we simply expect the submission to have the ranks in them; that is, for a test with N lines, the rankings will consist of the numbers 1...N, one number assigned to one line by the ranking algorithm (number 1 for the line deemed to be best, number N for the line deemed to be worse). So, by design, the burden of breaking possible ties in the ranking submission falls squarely on the side of the participants.

I hope this clarifies at least some of the issues raised.

-Radu

chardmeier commented 12 years ago

Hi, I'm a bit confused by the fact that the scoring script sometimes gives me negative AvgDelta scores. The example on the shared task website, with its use of absolute values, seems to suggest that the score should always be positive. I find it hard to understand what the perl code is actually doing, but clearly it's not taking any absolute values. Is it possible for the AvgDelta score to be negative? If so, how should this be interpreted? Best, Christian

rsoricut commented 12 years ago

Hi,

The AvgDelta can easily be negative. The intuition behind this number is the following: if you do a good job at ranking (the good segments ranked high, the bad ones ranked low), then taking some percentage of the top-ranked segments would give you a positive AvgDelta number saying "by how much Likert-wise the top-ranked sample is better than the overall quality" (say, +0.3 better). But if you do the opposite thing, and rank good translations low and bad translations high, then the AvgDelta number will come back negative, because the same interpretation holds (only this time the top-ranked sample is worse, therefore the AvgDelta will apear as negative). I hope this clarifies the issue, let us know if we should drill some more into this aspect.

-Radu

lspecia / QualityEstimation

About DeltaAvg and ties #2