Problem in reader leads to incorrect result of evaluation script

dnanhkhoa commented 4 years ago

Hello @jeniyat,

I have just found a critical problem in your evaluation script (which will be used to evaluate in the shared task).

For example, In protocol_138.ann, it includes 7 events (E13, E19, E47, E68, E5, E50, E4) with single relation role Commands in their arguments. So theoretically the evaluation result should show 7 in the support column of the Commands role, but I got 6 (see the image)

I went deeper and found that the missing come from E19

Line 25: E19    Action:T19 Acts-on:T142 Commands:E9
Line 193: E9    Action:T140

Since E9 is below E19, so when the reader processes E19, it doesn't know what is E9 to extract the entity tag (https://github.com/jeniyat/WNUT_2020_RE/blob/master/code/corpus/ProtoFile.py#L412), leads to missing relation Commands Arg1:T19 Arg2:T140. This problem could treat the correct predicted relation from someone's model as false positive -> incorrect precision. The order of annotation must not affect the result.

I tried a simple fix by moving E9 above E19 and it worked as I expected (see the image, the support count is now 7)

But that way doesn't fix the root problem, it needs to be fixed in the reader code (probably you should create a mapping Event -> Entity before converting events to relations).

jeniyat commented 4 years ago

Hi @dnanhkhoa, thanks for noticing the issue. We will update the evaluation to handle this and notify you.

dnanhkhoa commented 4 years ago

Thank you for the reply, hope that I will get the update soon. Thanks!

jeniyat commented 4 years ago

While waiting you can simply use the classification_report, precision_recall_fscore_support from sklearn.metrics as below:

print(classification_report(y_test, pred_test, target_names=cfg.RELATIONS, labels=range(len(cfg.RELATIONS))))
 print("Macro", precision_recall_fscore_support(y_test, pred_test, average='macro', labels=range(len(cfg.RELATIONS))))
 print("Micro", precision_recall_fscore_support(y_test, pred_test, average='micro', labels=range(len(cfg.RELATIONS))))

here. y_test are the gold labels and pred_test are the predicted labels.

jeniyat / WNUT_2020_RE

Problem in reader leads to incorrect result of evaluation script #1