How to compute the overall score

foxik commented 8 years ago

The original proposal computes the overall score as an arithmetic mean of individual corpus scores. That would probably motivate the participants to spend a lot of time tuning the performance on small corpora (where even 10 words can be more than 1% of the test set) -- we were even worried whether people would manually annotate more data in that language.

Currently, as discussed in #2, we propose to leave out too small corpora, which is partly motivated by this issue.

There are other possibilities how we could compute the overall score -- for one, we could use the F1 score computed from words from all corpora (i.e., analogously to microaccuracy). However, in this case the overall score would be determined by performance on biggest 5-10 corpora.

There are additional more complex ways how to compute the overall score, but in order for the proposal to be simple, we chose from the two described possibilities (and selected the macroaccuracy analogue).

However, maybe we could use some more complex way of computing the overall score -- for example, we could compute the weighted arithmetic mean, using logarithms of corpora sizes to be the weights.

dan-zeman commented 8 years ago

The logarithmic weights sound appealing to me but we should use the sizes of test data only, right? Not all corpora are split to 80-10-10 parts, so the size of the entire corpus may not give the weight we want.

Perhaps it would be better (and simpler) to add a "small test data track" as suggested by @ftyers in #2, and use a normal arithmetic mean of the scores; this would be the main overall system score in the shared task. I would leave training data as is (if there are a million training sentences, you can use them) but test data would be limited to roughly N syntactic words: if the test set is larger than N, sentences are drawn randomly until the number of tokens exceeds N. I guess we do not necessarily have to take N equal to the smallest corpus (Kazakh in 1.3, with 587 test words). The next size in 1.3 was Tamil with 1989 test words. So even if we set N=1000 or 1500, the influence of the smaller test sets should be sufficiently suppressed.

dan-zeman commented 8 years ago

A clarification: by "drawing sentences randomly" I mean a one-time process done by us before the shared task. Then the small test data would be fixed of course.

foxik commented 8 years ago

[EDIT: As noted below, the test data size is not the problem, so this whole comment is of no avail.]

In theory, we could even use the limit on the test data size in the main track, even with N=5k. That would make much less difference between small and large corpora (and it seems better than using logarithm for weight). Also the "test subsets" would not be released before system submission, so it would make the test set "less known" and "harder to optimize againts".

For information, here are sorted word sizes of UD 1.3 test data:

587    1989   2951    3821    3985
4105   4125   4235    4562    4832
5079   5158   5668    5843    5884
6262   6548   7018    7185    7953
8228   8481   8616    9140    9573
9591   10862  10952   11780   12012
12125  14063  14906   15734   16022
16268  16286  18375   18502   20377
23670  24374  25096   25251   28268
29438  29746  30034   35430   53594
59503  77688  107737  173920

The N=5k seems quite reasonable.

fginter commented 8 years ago

I personally do not think it's the test set size which matters here (as long as it is at least a little bit reasonable). Your score on a test set with 5000 words won't differ that much from your score on a test set with 500000 words. I think it is the training size and the overall "parseability" of the language which matters.

Since we are still in the design phase, let me propose for consideration only working with the ranks. If you are first on some language, you get one point, if you are second, you get two points, etc. And the system with least points wins. That way small languages with score in the thirties won't pull the overall number down and we are scoring how good your system is relative to others.

[Edit:] this of course works best if everyone is forced to submit all languages, but I believe that was our intention.

foxik commented 8 years ago

Oh -- you are right, somehow I went from "because test size is small, even 5 words are a big percentage" to "small test sizes will solve our problem". Thanks :-) (@dan-zeman Therefore, I think that the logarithmic weights should be computed from the size of the training data.)

As for the ranks -- we thought about it too, but the issue with ranks is that they are relative and the result depends highly on the group. Hopefully, people will try solving the problem even after the shared task, and then it will be very complicated to compare two new systems:

consider three systems A, B and C. With ranks, it is possible, that in a group {A, B} the system A will be better than B, but in a group {A, B, C} the system A will be worse than B.

Therefore, I believe some absolute score would be better. The current proposed variants are unweighted arithmetic mean and arithmetic mean weighted by log(training size). (Actually, the weight might be something like 1+log(training size / smallest training size), because if we use plain log(training size), the quotient of log(largest training size=1173282) and log(smallest training size=3973) for UD 1.3 is only 1.68 independently on base. For the 1+log... formula the logarithm base does matter and the quotient for base 10 is 3.47, for base e is 6.68 and for base 2 is 9.2. But again, the 1+log... depend on the size of the smallest corpus, so it would not be comparable if we have different smallest corpus.) Personally I am inclined to use just unweighted arithmetic mean.

fginter commented 8 years ago

OK w.r.t. the ranks. You're right on the need to compare also after the fact. For me, macro score (unweighted mean) is better than micro score, but the log one has its appeal of at least attempting to position itself between the two extremes. Unweighted mean does have the risk of the tiny low-score languages overpowering the ranking.

foxik commented 8 years ago

Yes, with the unweighted mean it probably makes sense to create a multi-lingual system which allows utilizing training data from multiple languages, and choose some "similar" corpora for every small corpus to train on. (This approach makes sense in every case, but for unweighted mean this will most likely result in a measurable improvement.)

dan-zeman commented 8 years ago

Correct me if I am wrong but I believe the outcome of the Berlin meeting is that we will use unweighted macro-average. (We may still fiddle with weights and publish it as additional statistics but that will not go into the task proposal.) Tentatively closing this issue.

ufal / conll2017

How to compute the overall score #3