Closed ritwikmishra closed 1 year ago
UPDATE:
I ran the bash command for each metric
perl reference-coreference-scorers/scorer.pl <muc/bcub/ceafe> <keys_file> <response_file> none
I got 86, 79, and 76 f1 for muc, bcub, and ceafe respectively. Average = 80.3 which is ~ 81 claimed in the paper
But still I am not able to get why calculate_conll.py
gives error...
Please, see this solution https://github.com/vdobrovolskii/wl-coref/issues/4
@vdobrovolskii I replaced the loop mechanism as suggested here.
And I commented out the error throwing condition in the perl script as suggested here
The perl script runs fine through bash:
perl reference-coreference-scorers/scorer.pl all data/conll_logs/roberta_test_e20.gold.conll data/conll_logs/roberta_test_e20.pred.conll none
But the python file still shows error:
$ python calculate_conll.py roberta test 20
Traceback (most recent call last):
File "calculate_conll.py", line 40, in <module>
extract_f1(subprocess.run(part_a + [metric] + part_b, **kwargs)))
File "calculate_conll.py", line 15, in extract_f1
return float(re.search(r"F1:\s*([0-9.]+)%", prev_line).group(1))
AttributeError: 'NoneType' object has no attribute 'group'
Is there any other way to fix calculate_conll.py
?
I believe the output is a bit different than expected for at least one of the perl scripts. Can you send me the outputs (just the last two lines) for the perl script with "muc", "ceafe" and "bcub" as metrics?
Here are the complete files
Predictions and Gold
I mean, can you send me the outputs of the perl script?
The calculate_conll.py
reads the stdout of the perl script and searches for metrics there.
Here is the output of perl reference-coreference-scorers/scorer.pl all data/conll_logs/roberta_test_e20.gold.conll data/conll_logs/roberta_test_e20.pred.conll none
Output
version: 8.01 /media/data_dump/Ritwik/git/wl-coref/reference-coreference-scorers/lib/CorScorer.pm
METRIC muc:
Repeated mention in the response: 231, 237 2121
Repeated mention in the response: 119, 122 3030
Repeated mention in the response: 158, 160 44
Repeated mention in the response: 57, 62 1515
Repeated mention in the response: 154, 158 55
Repeated mention in the response: 76, 78 1313
====== TOTALS =======
Identification of Mentions: Recall: (17786 / 19764) 89.99% Precision: (17786 / 20350) 87.4% F1: 88.67%
--------------------------------------------------------------------------
Coreference: Recall: (13376 / 15232) 87.81% Precision: (13376 / 15760) 84.87% F1: 86.31%
--------------------------------------------------------------------------
METRIC bcub:
Repeated mention in the response: 154, 158 55
Repeated mention in the response: 57, 62 1515
Repeated mention in the response: 119, 122 3030
Repeated mention in the response: 158, 160 44
Repeated mention in the response: 76, 78 1313
Repeated mention in the response: 231, 237 2121
====== TOTALS =======
Identification of Mentions: Recall: (17786 / 19764) 89.99% Precision: (17786 / 20350) 87.4% F1: 88.67%
--------------------------------------------------------------------------
Coreference: Recall: (16320.8804761468 / 19764) 82.57% Precision: (15765.8216227679 / 20351) 77.46% F1: 79.94%
--------------------------------------------------------------------------
METRIC ceafm:
Repeated mention in the response: 76, 78 1313
Repeated mention in the response: 57, 62 1515
Repeated mention in the response: 154, 158 55
Repeated mention in the response: 158, 160 44
Repeated mention in the response: 119, 122 3030
Repeated mention in the response: 231, 237 2121
====== TOTALS =======
Identification of Mentions: Recall: (17786 / 19764) 89.99% Precision: (17786 / 20350) 87.4% F1: 88.67%
--------------------------------------------------------------------------
Coreference: Recall: (16541 / 19764) 83.69% Precision: (16541 / 20351) 81.27% F1: 82.46%
--------------------------------------------------------------------------
METRIC ceafe:
Repeated mention in the response: 158, 160 44
Repeated mention in the response: 119, 122 3030
Repeated mention in the response: 57, 62 1515
Repeated mention in the response: 154, 158 55
Repeated mention in the response: 76, 78 1313
Repeated mention in the response: 231, 237 2121
====== TOTALS =======
Identification of Mentions: Recall: (17786 / 19764) 89.99% Precision: (17786 / 20350) 87.4% F1: 88.67%
--------------------------------------------------------------------------
Coreference: Recall: (3495.57012386207 / 4532) 77.13% Precision: (3495.57012386207 / 4591) 76.13% F1: 76.63%
--------------------------------------------------------------------------
METRIC blanc:
Repeated mention in the response: 76, 78 1313
Repeated mention in the response: 154, 158 55
Repeated mention in the response: 57, 62 1515
Repeated mention in the response: 158, 160 44
Repeated mention in the response: 119, 122 3030
Repeated mention in the response: 231, 237 2121
====== TOTALS =======
Identification of Mentions: Recall: (17786 / 19764) 89.99% Precision: (17786 / 20350) 87.4% F1: 88.67%
--------------------------------------------------------------------------
Coreference:
Coreference links: Recall: (98009 / 111931) 87.56% Precision: (98009 / 121567) 80.62% F1: 83.94%
--------------------------------------------------------------------------
Non-coreference links: Recall: (703839 / 883032) 79.7% Precision: (703839 / 925055) 76.08% F1: 77.85%
--------------------------------------------------------------------------
BLANC: Recall: (0.836345287908459 / 1) 83.63% Precision: (0.783537821987766 / 1) 78.35% F1: 80.9%
--------------------------------------------------------------------------
Hmm. Each line that is supposed to be fed to the script is matched correctly:
Then it might be the case that when calling each metric separately the output is different...
I could investigate it further. Could you kindly modify the extract_f1 function as follows as run the script again? Then send me the output.
def extract_f1(proc: subprocess.CompletedProcess) -> float:
prev_line = ""
curr_line = ""
for line in str(proc.stdout).splitlines():
prev_line = curr_line
curr_line = line
print(repr(prev_line))
return float(re.search(r"F1:\s*([0-9.]+)%", prev_line).group(1))
The issue was in the way you were converting bytes to string. As stated here; simply typecasting bytes to string using str()
will give you unintended results. Bytes should be decoded in order to get appropriate strings.
Changing the for loop for line in str(proc.stdout).splitlines():
---> for line in (proc.stdout).decode('utf-8').splitlines():
worked!
Output:
muc 86.31
ceafe 76.63
bcub 79.94
avg 80.96
I ran the preparation scripts successfully.
Downloaded the roberta checkpoint from dropbox link, and placed it in data folder.
Ran the command:
python calculate_conll.py roberta test 20
I noticed some errors due to subprocess because I was using python3.6 instead of python3.7.
Error was:
unexpected keyword argument 'capture_output'
Fixed the issue with this
But then I got an error:
'NoneType' object has no attribute 'group'
origin of error --> line 15I ran the perl script directly in bash:
perl reference-coreference-scorers/scorer.pl all data/conll_logs/roberta_test_e20.gold.conll data/conll_logs/roberta_test_e20.pred.conll none
MUC came out to be 86 (f1) but while calculating b3, I got this error:
Found too many repeated mentions (> 10) in the response, so refusing to score. Please fix the output
I think it is because of this error only that the line 15 above was throwing that error (because output was empty).
How to proceed forward now? How to evaluate the results?