Closed simon-clematide closed 4 years ago
@simon-clematide +pinging @mromanello and @aflueckiger
I think we should consider 2 questions/points:
(1) capacity to provide fine-grained evaluation reports to participants.
(2) SER, i.e. the capacity to weight differently types of mistakes (penalizing more type errors than boundary errors, and even more entities where there is both type and boundary errors). If the fine-grained report is done, then SER is just another measure combining things a bit differently.
What can we gain out of SER? I think a little better understanding of why systems are wrong, but not dramatically more infos.
Overall we could leave out SER, but detailed eval report could be useful.
With the eval script we can get very nuanced error reports. It has a very agnostic basis and aggregates numbers on different levels (also type confusion, which we won't use for the official ranking).
Thus I suggest to drop SER.
can be closed
@maud What would be the expected benefits for this evaluation if we
paper: https://pdfs.semanticscholar.org/451b/61b390b86ae5629a21461d4c619ea34046e0.pdf