Closed foxik closed 8 years ago
The logarithmic weights sound appealing to me but we should use the sizes of test data only, right? Not all corpora are split to 80-10-10 parts, so the size of the entire corpus may not give the weight we want.
Perhaps it would be better (and simpler) to add a "small test data track" as suggested by @ftyers in #2, and use a normal arithmetic mean of the scores; this would be the main overall system score in the shared task. I would leave training data as is (if there are a million training sentences, you can use them) but test data would be limited to roughly N syntactic words: if the test set is larger than N, sentences are drawn randomly until the number of tokens exceeds N. I guess we do not necessarily have to take N equal to the smallest corpus (Kazakh in 1.3, with 587 test words). The next size in 1.3 was Tamil with 1989 test words. So even if we set N=1000 or 1500, the influence of the smaller test sets should be sufficiently suppressed.
A clarification: by "drawing sentences randomly" I mean a one-time process done by us before the shared task. Then the small test data would be fixed of course.
[EDIT: As noted below, the test data size is not the problem, so this whole comment is of no avail.]
In theory, we could even use the limit on the test data size in the main track, even with N=5k. That would make much less difference between small and large corpora (and it seems better than using logarithm for weight). Also the "test subsets" would not be released before system submission, so it would make the test set "less known" and "harder to optimize againts".
For information, here are sorted word sizes of UD 1.3 test data:
587 1989 2951 3821 3985
4105 4125 4235 4562 4832
5079 5158 5668 5843 5884
6262 6548 7018 7185 7953
8228 8481 8616 9140 9573
9591 10862 10952 11780 12012
12125 14063 14906 15734 16022
16268 16286 18375 18502 20377
23670 24374 25096 25251 28268
29438 29746 30034 35430 53594
59503 77688 107737 173920
The N=5k seems quite reasonable.
I personally do not think it's the test set size which matters here (as long as it is at least a little bit reasonable). Your score on a test set with 5000 words won't differ that much from your score on a test set with 500000 words. I think it is the training size and the overall "parseability" of the language which matters.
Since we are still in the design phase, let me propose for consideration only working with the ranks. If you are first on some language, you get one point, if you are second, you get two points, etc. And the system with least points wins. That way small languages with score in the thirties won't pull the overall number down and we are scoring how good your system is relative to others.
[Edit:] this of course works best if everyone is forced to submit all languages, but I believe that was our intention.
Oh -- you are right, somehow I went from "because test size is small, even 5 words are a big percentage" to "small test sizes will solve our problem". Thanks :-) (@dan-zeman Therefore, I think that the logarithmic weights should be computed from the size of the training data.)
As for the ranks -- we thought about it too, but the issue with ranks is that they are relative and the result depends highly on the group. Hopefully, people will try solving the problem even after the shared task, and then it will be very complicated to compare two new systems:
Therefore, I believe some absolute score would be better. The current proposed variants are unweighted arithmetic mean and arithmetic mean weighted by log(training size). (Actually, the weight might be something like 1+log(training size / smallest training size), because if we use plain log(training size), the quotient of log(largest training size=1173282) and log(smallest training size=3973) for UD 1.3 is only 1.68 independently on base. For the 1+log... formula the logarithm base does matter and the quotient for base 10 is 3.47, for base e is 6.68 and for base 2 is 9.2. But again, the 1+log... depend on the size of the smallest corpus, so it would not be comparable if we have different smallest corpus.) Personally I am inclined to use just unweighted arithmetic mean.
OK w.r.t. the ranks. You're right on the need to compare also after the fact. For me, macro score (unweighted mean) is better than micro score, but the log one has its appeal of at least attempting to position itself between the two extremes. Unweighted mean does have the risk of the tiny low-score languages overpowering the ranking.
Yes, with the unweighted mean it probably makes sense to create a multi-lingual system which allows utilizing training data from multiple languages, and choose some "similar" corpora for every small corpus to train on. (This approach makes sense in every case, but for unweighted mean this will most likely result in a measurable improvement.)
Correct me if I am wrong but I believe the outcome of the Berlin meeting is that we will use unweighted macro-average. (We may still fiddle with weights and publish it as additional statistics but that will not go into the task proposal.) Tentatively closing this issue.
The original proposal computes the overall score as an arithmetic mean of individual corpus scores. That would probably motivate the participants to spend a lot of time tuning the performance on small corpora (where even 10 words can be more than 1% of the test set) -- we were even worried whether people would manually annotate more data in that language.
Currently, as discussed in #2, we propose to leave out too small corpora, which is partly motivated by this issue.
There are other possibilities how we could compute the overall score -- for one, we could use the F1 score computed from words from all corpora (i.e., analogously to microaccuracy). However, in this case the overall score would be determined by performance on biggest 5-10 corpora.
There are additional more complex ways how to compute the overall score, but in order for the proposal to be simple, we chose from the two described possibilities (and selected the macroaccuracy analogue).
However, maybe we could use some more complex way of computing the overall score -- for example, we could compute the weighted arithmetic mean, using logarithms of corpora sizes to be the weights.