issues and problems with the f1 metrics for evaluating word-level quality estimation

chrishokamp commented 9 years ago

Put thoughts on metrics into this issue:

after discussion on 18.2.15:

the basic class f1 metric only makes sense when we take all of the classes into account
- if we don't consider all of the classes, it's easy to game the systems, by labeling all tokens as 'GOOD' or 'BAD', or by optimizing on a metric that doesn't reflect overall performance
we have the intuition that the 'BAD' class is somehow more important for word-level QE, perhaps because it is more rare, but this intuition is not well motivated.
it's difficult to use NER metrics because we currently cannot group errors together above the word level. In other words, we don't know where one error 'begins' and another 'ends'. A different dataset might solve this problem.
Aligning the tagged strings with the reference is the same as measuring tagging accuracy, and this seems to be a satisfying metric when the class frequencies are not too different.

chrishokamp commented 9 years ago

A metric similar to the BLEU score could make sense -- to measure the overlap between spans in the hypothesis and the reference. The key idea is that we do not discard a span if it is only a partial match, but its score does get penalized.

varvara-l commented 9 years ago

each span can be used only once -- so we don't give the high score to sequences of all good or all bad labels.

chrishokamp commented 9 years ago

20150218_200511

chrishokamp commented 9 years ago

quick rule of thumb -- if your class f1 measures sum to 1 or < 1, there's probably something bad going on, and you may not be learning anything.

qe-team / marmot

issues and problems with the f1 metrics for evaluating word-level quality estimation #25