hipe-eval / HIPE-scorer

A python module for evaluating NERC and NEL system performances as defined in the HIPE shared tasks (formerly CLEF-HIPE-2020-scorer).
https://hipe-eval.github.io
MIT License
13 stars 4 forks source link

Evaluation measures: Slot error rates #3

Closed simon-clematide closed 4 years ago

simon-clematide commented 4 years ago

@maud What would be the expected benefits for this evaluation if we

paper: https://pdfs.semanticscholar.org/451b/61b390b86ae5629a21461d4c619ea34046e0.pdf

e-maud commented 4 years ago

@simon-clematide +pinging @mromanello and @aflueckiger

I think we should consider 2 questions/points:

(1) capacity to provide fine-grained evaluation reports to participants.

(2) SER, i.e. the capacity to weight differently types of mistakes (penalizing more type errors than boundary errors, and even more entities where there is both type and boundary errors). If the fine-grained report is done, then SER is just another measure combining things a bit differently.

What can we gain out of SER? I think a little better understanding of why systems are wrong, but not dramatically more infos.

Overall we could leave out SER, but detailed eval report could be useful.

aflueckiger commented 4 years ago

With the eval script we can get very nuanced error reports. It has a very agnostic basis and aggregates numbers on different levels (also type confusion, which we won't use for the official ranking).

Thus I suggest to drop SER.

mromanello commented 4 years ago

can be closed