Evaluate function not right

lx865712528 / EMNLP2018-JMEE

This is the code for our EMNLP 2018 paper "Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation"

236 stars 57 forks source link

Evaluate function not right #6

Open airkid opened 5 years ago

airkid commented 5 years ago

https://github.com/lx865712528/JMEE/blob/494451d5852ba724d273ee6f97602c60a5517446/enet/testing.py#L72
In this line, if I add a line of code before
assert len(arugments) == len(argumenst_) There will be assert error.
I believe this is because in arugments there are golden arguments while only predict arugments in arguments_, which length will change dynamicly during traning.

DorianKodelja commented 5 years ago

This computes the score wrong since if the model predict a wrong entity before all the good ones, the preds are not aligned and the score is 0, as shown in this example: gold roles are [(3,5,11),(7,9,9)] preds roles are [(0,2,2),(3,5,11),(7,9,9)] first iteration: compare (3,5,11) and (0,2,2) -> fail second iteration: compare (7,9,9) and (3,5,11) -> fail even though (3,5,11) was in the gold annotations. Here is a functionning version that also generate a per class report (it requires tabulate)

calculate_sets_1.txt

mikelkl commented 5 years ago

Hi @airkid @DorianKodelja, I got with conclusion with you, according to DMCNN paper:

An argument is correctly classifiedd if its event subtype, offsets and argument role match those of any of the reference argument mentions

for item, item_ in zip(arguments, arguments_):

Above code in this repo does match the idea, so I replaced that line with:

ct += len(set(arguments) & set(arguments_))  # count any argument in golden
# for item, item_ in zip(arguments, arguments_):
#     if item[2] == item_[2]:
#         ct += 1

airkid commented 5 years ago

Hi @mikelkl , I believe this is a kind of right implementation of calculating F1 score in this task.
Have you reproduce the experiment? I can only reach F1 score < 0.4 in the test data.

mikelkl commented 5 years ago

Hi @airkid, I got slightly higher result, but it's on my own randomly splitting test set, hv no idea if it can efficively represent the paper result.

airkid commented 5 years ago

Hi @mikelkl, can you try on the data split update by author?
My result is still far away from the paper.

mikelkl commented 5 years ago

Hi @airkid, I'm afraid I cannot do that coz I hv no ACE2005 English data

carrie0307 commented 5 years ago

Hi @airkid Would you please tell me the result you got? I got only f1=0.64 in Trigger Classification.

rhythmswing commented 4 years ago

https://github.com/lx865712528/JMEE/blob/494451d5852ba724d273ee6f97602c60a5517446/enet/testing.py#L72 In this line, if I add a line of code before assert len(arugments) == len(argumenst_) There will be assert error. I believe this is because in arugments there are golden arguments while only predict arugments in arguments_, which length will change dynamicly during traning.

Hi,

If you've tried their code, would you tell me your reproduced results on trigger detection and argument detection?