Open chrishokamp opened 9 years ago
A metric similar to the BLEU score could make sense -- to measure the overlap between spans in the hypothesis and the reference. The key idea is that we do not discard a span if it is only a partial match, but its score does get penalized.
quick rule of thumb -- if your class f1 measures sum to 1 or < 1, there's probably something bad going on, and you may not be learning anything.
Put thoughts on metrics into this issue:
after discussion on 18.2.15: