Evaluation should use 'macro' averaging rather than 'micro'

TheShadow29 commented 5 years ago

I believe on semeval, macro averaging is used. Thus in L45: https://github.com/lemonhu/RE-CNN-pytorch/blob/master/evaluate.py#L45, it should be macro and not micro.

Changing it to macro gives precison: 70.38; recall: 76.86; f1: 73.10

lemonhu commented 5 years ago

@TheShadow29 Thanks for your feedback, you are right, the official evaluation method requires the use of macro instead of micro.

The label and quantity of each relationship on the test set are as follows:

Cause-Effect(e1,e2): 134 Cause-Effect(e2,e1): 194 Instrument-Agency(e1,e2): 22 Instrument-Agency(e2,e1): 134 Product-Producer(e1,e2): 108 Product-Producer(e2,e1): 123 Content-Container(e1,e2): 153 Content-Container(e2,e1): 39 Entity-Origin(e1,e2): 211 Entity-Origin(e2,e1): 47 Entity-Destination(e1,e2): 291 Entity-Destination(e2,e1): 1 Component-Whole(e1,e2): 162 Component-Whole(e2,e1): 150 Message-Topic(e1,e2): 210 Message-Topic(e2,e1): 51 Member-Collection(e1,e2): 32 Member-Collection(e2,e1): 201 Other: 454

I think the reason for the poor performance of using macro is the unbalanced distribution of data quantity. For example, the number of relationship categories Entity-Destination(e2,e1) is 1, and if the category relationship is predicted incorrectly, this will have a serious impact on the overall results.

Note that the relationship here is to consider the direction.

TheShadow29 commented 5 years ago

You are correct about the distribution. I believe the paper reports 82 F1 using macro averaging. I am guessing they use the scripts given in the scorer here: http://semeval2.fbk.eu/semeval2.php?location=data (number 8, scorer column) though I don't think it should be any different from sklearn output.

What do you think is causing the difference in the results?

lemonhu commented 5 years ago

Yes, I also believe the paper reports 82 F1 using macro averaging.

I think the essential reason for the different results of the two evaluation methods (macro and micro) is the distribution of the quantity of data.

If we do not evaluate the relationship of the subdivision, for example, without subdividing Entity-Destination into Entity-Destination(e1,e2) and Entity-Destination(e2,e1), the evaluation is only for Entity-Destination, it will be another result, and it may be better.

lemonhu / RE-CNN-pytorch

Evaluation should use 'macro' averaging rather than 'micro' #1