Dev and Test results don't match with Paper

vdobrovolskii / wl-coref

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

MIT License

104 stars 37 forks source link

Dev and Test results don't match with Paper #42

Closed shantanu778 closed 1 year ago

shantanu778 commented 1 year ago

I wanted to replicate your result on Ontonotes 5 with your provided pretrained model, roberta_(e20_2021.05.0201.16)release.pt . But my results in dev and test test set are not the same as mentioned in your paper.

Here is the results,

dev: | WL:  loss: 0.12536, f1: 0.76390, p: 0.71747, r: 0.81674 | SL:  sa: 0.96890, f1: 0.73320, p: 0.69405, r: 0.77703

test: | WL:  loss: 0.11088, f1: 0.77439, p: 0.72105, r: 0.83624 | SL:  sa: 0.96855, f1: 0.74050, p: 0.69525, r: 0.79205

I used the default config file, config.toml. Can you please tell me why it is not providing the same results?

vdobrovolskii commented 1 year ago

Hi!

The numbers you see should be showing the LEA metric rather than the CoNLL-12 metric reported in the paper.

You need to run calculate_conll.py on the produced files to get the CoNLL metrics. Please refer to the Evaluation section of the README.

shantanu778 commented 1 year ago

Thanks. But I encountred another issue when I tried to run calculate_conll.py

====== TOTALS =======
Identification of Mentions: Recall: (17754 / 19764) 89.82%      Precision: (17754 / 21181) 83.82%       F1: 86.72%
--------------------------------------------------------------------------
Coreference: Recall: (13264 / 15232) 87.07%     Precision: (13264 / 16394) 80.9%        F1: 83.88%
--------------------------------------------------------------------------
None

Traceback (most recent call last):
  File "calculate_conll.py", line 42, in <module>
    extract_f1(subprocess.run(part_a + [metric] + part_b, **kwargs)))
  File "calculate_conll.py", line 17, in extract_f1
    return float(re.search(r"F1:\s*([0-9.]+)%", prev_line).group(1))
AttributeError: 'NoneType' object has no attribute 'group'

vdobrovolskii commented 1 year ago

Hm. You can try running the evaluation scripts manually. I am using the official scorer from here: https://github.com/conll/reference-coreference-scorers

shantanu778 commented 1 year ago

@vdobrovolskii I am able to solve the issue in calculate_conll.py you have to pass "capture_output": True as well in Subprocess.run

kwargs = {"check": True, "text": True, "capture_output": True}
extract_f1(subprocess.run(part_a + [metric] + part_b, **kwargs)))

Now it returns None.

vdobrovolskii commented 1 year ago

I guess it is related with the difference in python versions... Thanks for posting the solution!