Open masterpiga opened 12 years ago
I have to rectify my previous post, in the sense that the ties in the actual oracle scores do not have any effect on the final evaluation. The only ties that have an effect are those in the scores produced by a system. So, if a system produces the same score for two sentences that have different scores according to the oracle, then the way in which they tie is broken can have an impact on the evaluation. Also, the way ties are broken has no effect on the evaluation provided that also the oracle scores are tied.
Hi all,
This is how I see the issues that Daniele brought up:
I hope this clarifies at least some of the issues raised.
-Radu
Hi, I'm a bit confused by the fact that the scoring script sometimes gives me negative AvgDelta scores. The example on the shared task website, with its use of absolute values, seems to suggest that the score should always be positive. I find it hard to understand what the perl code is actually doing, but clearly it's not taking any absolute values. Is it possible for the AvgDelta score to be negative? If so, how should this be interpreted? Best, Christian
Hi,
The AvgDelta can easily be negative. The intuition behind this number is the following: if you do a good job at ranking (the good segments ranked high, the bad ones ranked low), then taking some percentage of the top-ranked segments would give you a positive AvgDelta number saying "by how much Likert-wise the top-ranked sample is better than the overall quality" (say, +0.3 better). But if you do the opposite thing, and rank good translations low and bad translations high, then the AvgDelta number will come back negative, because the same interpretation holds (only this time the top-ranked sample is worse, therefore the AvgDelta will apear as negative). I hope this clarifies the issue, let us know if we should drill some more into this aspect.
-Radu
Hi all,
I am running the evaluation script by passing it the same file twice, i.e. the ranks and scores are identical across the two files. The file that I am using for the evaluation is posted here: https://gist.github.com/1779210
And this is the output that I get:
oracle :: Ranking: (primary) DeltaAvg = 0.83 (secondary) Spearman-Corr = 1.00 oracle :: Scoring: (primary) MeanAvgErr = 0.00 (secondary) RootMeanSqrErr = 0.00 LargeErrPerc = 0.0 SmallErrPerc = 100.0
Looking at the formula for AvgDelta, it seems like a perfect separation does not imply an AvgDelta of 1, but the description suggests otherwise. Am I correct?
Concerning ties, the instructions say that "no ties are allowed", for simplicity. But I would like to point out that the actual point labels are heavily quantized (there are 23 distinct scores in the development set). As a result, the way in which the quality scores are output AND the way in which ties are resolved can both have a major effect on the final evaluation. For example, a system that outputs quality scores with more than one decimal digit can be heavily penalized. I was wondering whether it wouldn't be appropriate to agree on common strategies to format the scores and to sort the segments that are in a tie. Or, as an alternative, decide to allow ties instead.
Thanks, cheers
Daniele