Closed TheShadow29 closed 5 years ago
@TheShadow29 Thanks for your feedback, you are right, the official evaluation method requires the use of macro
instead of micro
.
The label and quantity of each relationship on the test set are as follows:
Cause-Effect(e1,e2): 134 Cause-Effect(e2,e1): 194 Instrument-Agency(e1,e2): 22 Instrument-Agency(e2,e1): 134 Product-Producer(e1,e2): 108 Product-Producer(e2,e1): 123 Content-Container(e1,e2): 153 Content-Container(e2,e1): 39 Entity-Origin(e1,e2): 211 Entity-Origin(e2,e1): 47 Entity-Destination(e1,e2): 291 Entity-Destination(e2,e1): 1 Component-Whole(e1,e2): 162 Component-Whole(e2,e1): 150 Message-Topic(e1,e2): 210 Message-Topic(e2,e1): 51 Member-Collection(e1,e2): 32 Member-Collection(e2,e1): 201 Other: 454
I think the reason for the poor performance of using macro
is the unbalanced distribution of data quantity. For example, the number of relationship categories Entity-Destination(e2,e1)
is 1, and if the category relationship is predicted incorrectly, this will have a serious impact on the overall results.
Note that the relationship here is to consider the direction.
You are correct about the distribution. I believe the paper reports 82
F1 using macro averaging. I am guessing they use the scripts given in the scorer here: http://semeval2.fbk.eu/semeval2.php?location=data (number 8, scorer column) though I don't think it should be any different from sklearn
output.
What do you think is causing the difference in the results?
Yes, I also believe the paper reports 82
F1 using macro averaging.
I think the essential reason for the different results of the two evaluation methods (macro
and micro
) is the distribution of the quantity of data.
If we do not evaluate the relationship of the subdivision, for example, without subdividing Entity-Destination
into Entity-Destination(e1,e2)
and Entity-Destination(e2,e1)
, the evaluation is only for Entity-Destination
, it will be another result, and it may be better.
I believe on semeval,
macro
averaging is used. Thus in L45: https://github.com/lemonhu/RE-CNN-pytorch/blob/master/evaluate.py#L45, it should bemacro
and notmicro
.Changing it to
macro
givesprecison: 70.38; recall: 76.86; f1: 73.10