Errors when evaluation - Githubissues

yhcc commented 2 years ago

After training, there are a lot of files generated in the data/conll_logs folder, their names are like roberta_dev_e1.gold.conll, roberta_dev_e1.pred.conll, then I called python calculate_conll.py roberta dev 20 to evaluation them, the following errors occured

Traceback (most recent call last):
  File "calculate_conll.py", line 38, in <module>
    extract_f1(subprocess.run(part_a + [metric] + part_b, **kwargs)))
  File "/home/xxx/envs/wl-coref/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['perl', 'reference-coreference-scorers/scorer.pl', 'muc', 'data/conll_logs/roberta_dev_e20.gold.conll', 'data/conll_logs/roberta_dev_e20.pred.conll']' returned non-zero exit status 1.

then I tried to run evaluation directly throught perl reference-coreference-scorers/scorer.pl muc data/conll_logs/roberta_dev_e20.gold.conll data/conll_logs/roberta_dev_e20.pred.conll, then the output is something like the following

Invented: 18
Recall: (44 / 53) 83.01%        Precision: (44 / 59) 74.57%     F1: 78.57%
--------------------------------------------------------------------------
====> (nw/wsj/01/wsj_0174); part 000:
File (nw/wsj/01/wsj_0174); part 000:
Entity 0: (0,0) (13,15) (46,46) (57,57) (72,72)
Entity 1: (2,6) (27,35)
Entity 2: (5,6) (20,24) (65,66) (81,82) (105,106) (109,109) (128,130)
Entity 3: (9,9) (50,55) (69,70)
Entity 4: (26,26) (89,91)
Entity 5: (38,38) (313,314) (345,347) (682,682) (722,722) (747,749)
Entity 6: (85,86) (672,674)
Entity 7: (132,135) (140,140) (168,171)
Entity 8: (140,141) (183,184)
Entity 9: (187,188) (684,686)
Entity 10: (205,206) (208,212)
Entity 11: (224,225) (812,813)
Entity 12: (254,265) (272,273)
Entity 13: (260,265) (296,297)
Entity 14: (299,302) (330,332)
Entity 15: (307,328) (330,333)
Entity 16: (313,318) (345,352)
Entity 17: (362,374) (376,377)
Entity 18: (401,402) (428,428) (441,441)
Entity 19: (423,425) (451,451)
Entity 20: (499,500) (701,702) (815,816)
Entity 21: (509,509) (549,551)
Entity 22: (517,518) (544,546)
Entity 23: (535,536) (544,545) (558,559)
Entity 24: (563,564) (568,570)
Entity 25: (563,566) (582,582) (590,590) (594,594)
Entity 26: (600,601) (607,607)
Entity 27: (611,612) (622,624)
Entity 28: (611,613) (638,638) (646,649)
Entity 29: (676,686) (711,720)
Entity 30: (691,694) (696,697)
Entity 31: (724,745) (747,750)
Entity 32: (763,764) (803,804)
Entity 33: (767,776) (780,780)
====> (nw/wsj/01/wsj_0174); part 000:
File (nw/wsj/01/wsj_0174); part 000:
Entity 0: (0,0) (13,15) (46,46) (57,57) (72,72)
Entity 1: (2,6) (27,35)
Entity 2: (5,6) (20,24) (65,66) (81,82) (105,106) (109,109) (128,130)
Entity 3: (9,9) (50,55) (69,70) (446,451)
Entity 4: (26,26) (89,91)
Entity 5: (38,38) (313,314) (345,347) (682,682) (722,722) (747,749)
Entity 6: (85,86) (672,674)
Entity 7: (132,135) (140,140) (168,180)
Entity 8: (140,141) (183,184) (200,203)
Entity 9: (205,206) (208,212)
Entity 10: (254,265) (272,273)
Entity 11: (260,265) (296,297)
Entity 12: (299,302) (330,332)
Entity 13: (307,328) (330,333)
Entity 14: (313,318) (345,352)
Entity 15: (362,374) (376,377)
Entity 16: (401,402) (428,428) (441,441)
Entity 17: (423,425) (451,451)
Entity 18: (455,455) (489,489)
Entity 19: (456,456) (471,472)
Entity 20: (499,500) (701,702) (815,816)
Entity 21: (509,509) (549,551)
Entity 22: (520,520) (544,545)
Entity 23: (535,536) (558,559)
Entity 24: (563,564) (568,570)
Entity 25: (563,566) (582,582) (590,590) (594,594)
Entity 26: (600,601) (607,607)
Entity 27: (611,612) (622,624) (662,663)
Entity 28: (611,613) (638,638) (646,649)
Entity 29: (615,628) (631,644) (652,668)
Entity 30: (676,686) (711,720) (711,720)
Entity 31: (691,694) (696,697)
Entity 32: (724,745) (747,750)
Entity 33: (763,764) (803,804)
Entity 34: (767,776) (780,780)
(nw/wsj/01/wsj_0174); part 000:
Repeated mention in the response: 711, 720 7979
Found too many repeated mentions (> 10) in the response, so refusing to score. Please fix the output.

the last line says there are many repeated mentions. Do you have any idea why this is happening?

vdobrovolskii commented 2 years ago

Hi!

This happens because the span prediction module predicts the same span for different heads. The scorer doesn't like that :)

The quick and hacky workaround will be to modify reference-coreference-scorers/lib/CorScorer.pm and comment out the condition on line 382.

I'll upload a nicer solution soon.

vdobrovolskii commented 2 years ago

@yhcc

Please, try the following: In coref/conll.py, line 31 change the following loop

for cluster_id, cluster in enumerate(clusters):
    for start, end in cluster:
        if end - start == 1:
            single_word[start].append(cluster_id)
        else:
            starts[start].append(cluster_id)
            ends[end - 1].append(cluster_id)

to be:

predicted_spans = set()

for cluster_id, cluster in enumerate(clusters):
    for start, end in cluster:
        if (start, end) in predicted_spans:
            continue
        predicted_spans.add((start, end))
        if end - start == 1:
            single_word[start].append(cluster_id)
        else:
            starts[start].append(cluster_id)
            ends[end - 1].append(cluster_id)

This is a suboptimal solution, but it will fix your problem. You will need to reevaluate the dev dataset after the fix.

Please let me know how that worked out for you. Thanks for the report!

yhcc commented 2 years ago

Thanks for your reply. I will try this revision.

yhcc commented 2 years ago

It worked. Thank you very much.

vdobrovolskii / wl-coref

Errors when evaluation #4