Difference between types of evaluation?

ngiordani commented 8 years ago

I'm trying to understand the different numbers I get from running

Evaluators.BioNLP11GeniaTools Evaluators.EvaluateInteractionXML

I know the first one call the official eval script. But the numbers I get from the second one are pretty different. Is that expected? After spending a while trying to trace what the code is doing, I still haven't been able to understand how the metrics in Evaluators.EvaluateInteractionXML differ from those in the official eval script.

I know there's a lot of metrics so this question may be hard to answer, but the most important thing would be to know if the difference is expected or if something is wrong.

jbjorne commented 8 years ago

The evaluators available via BioNLP11GeniaTools (this module should be renamed) are indeed the official BioNLP Shared Task evaluators, and whenever such an evaluator is available and can be used, it should be used. EvaluateInteractionXML is a generic event evaluator which considers an event to be a trigger node and its set of outgoing edges, roughly imitating the "approximate span and recursive" evaluation criterion of the BioNLP Shared Task GENIA tasks.

For example, for the BioNLP 2011 GENIA task the "approximate span and recursive" mode F-score from the official evaluation is 54.26% whereas EvaluateInteractionXML reports an F-score of 65.26%. On a very general level, increases in performance measured with EvaluateInteractionXML tend to translate to increases in performance in event extraction tasks, but this of course depends on how close the evaluation metrics are. Thus, EvaluateInteractionXML is mostly useful for optimizing a system in situations where an official evaluator program is not available.

ngiordani commented 8 years ago

Thank you, that definitely clarifies things!

On Fri, Mar 25, 2016 at 7:58 AM, Jari Björne notifications@github.com wrote:

The evaluators available via BioNLP11GeniaTools (this module should be renamed) are indeed the official BioNLP Shared Task evaluators, and whenever such an evaluator is available and can be used, it should be used. EvaluateInteractionXML is a generic event evaluator which considers an event to be a trigger node and its set of outgoing edges, roughly imitating the "approximate span and recursive" evaluation criterion of the BioNLP Shared Task GENIA tasks.

For example, for the BioNLP 2011 GENIA task the "approximate span and recursive" mode F-score from the official evaluation is 54.26% whereas EvaluateInteractionXML reports an F-score of 65.26%. On a very general level, increases in performance measured with EvaluateInteractionXML tend to translate to increases in performance in event extraction tasks, but this of course depends on how close the evaluation metrics are. Thus, EvaluateInteractionXML is mostly useful for optimizing a system in situations where an official evaluator program is not available.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/jbjorne/TEES/issues/19#issuecomment-201322260

jbjorne / TEES

Difference between types of evaluation? #19