Conll perl script refusing to score because of "too many repeated mentions (>10) in the response"

ritwikmishra commented 1 year ago

I ran the preparation scripts successfully.

Downloaded the roberta checkpoint from dropbox link, and placed it in data folder.

Ran the command: python calculate_conll.py roberta test 20

I noticed some errors due to subprocess because I was using python3.6 instead of python3.7.

Error was: unexpected keyword argument 'capture_output'

Fixed the issue with this

But then I got an error: 'NoneType' object has no attribute 'group' origin of error --> line 15

I ran the perl script directly in bash: perl reference-coreference-scorers/scorer.pl all data/conll_logs/roberta_test_e20.gold.conll data/conll_logs/roberta_test_e20.pred.conll none

MUC came out to be 86 (f1) but while calculating b3, I got this error: Found too many repeated mentions (> 10) in the response, so refusing to score. Please fix the output

I think it is because of this error only that the line 15 above was throwing that error (because output was empty).

How to proceed forward now? How to evaluate the results?

ritwikmishra commented 1 year ago

UPDATE:

I ran the bash command for each metric

perl reference-coreference-scorers/scorer.pl <muc/bcub/ceafe> <keys_file> <response_file> none

I got 86, 79, and 76 f1 for muc, bcub, and ceafe respectively. Average = 80.3 which is ~ 81 claimed in the paper

But still I am not able to get why calculate_conll.py gives error...

vdobrovolskii commented 1 year ago

Please, see this solution https://github.com/vdobrovolskii/wl-coref/issues/4

ritwikmishra commented 1 year ago

@vdobrovolskii I replaced the loop mechanism as suggested here.

And I commented out the error throwing condition in the perl script as suggested here

The perl script runs fine through bash:

perl reference-coreference-scorers/scorer.pl all data/conll_logs/roberta_test_e20.gold.conll data/conll_logs/roberta_test_e20.pred.conll none

But the python file still shows error:

$ python calculate_conll.py roberta test 20
Traceback (most recent call last):
  File "calculate_conll.py", line 40, in <module>
    extract_f1(subprocess.run(part_a + [metric] + part_b, **kwargs)))
  File "calculate_conll.py", line 15, in extract_f1
    return float(re.search(r"F1:\s*([0-9.]+)%", prev_line).group(1))
AttributeError: 'NoneType' object has no attribute 'group'

Is there any other way to fix calculate_conll.py ?

vdobrovolskii commented 1 year ago

I believe the output is a bit different than expected for at least one of the perl scripts. Can you send me the outputs (just the last two lines) for the perl script with "muc", "ceafe" and "bcub" as metrics?

ritwikmishra commented 1 year ago

Here are the complete files

Predictions and Gold

vdobrovolskii commented 1 year ago

I mean, can you send me the outputs of the perl script?

The calculate_conll.py reads the stdout of the perl script and searches for metrics there.

ritwikmishra commented 1 year ago

Here is the output of perl reference-coreference-scorers/scorer.pl all data/conll_logs/roberta_test_e20.gold.conll data/conll_logs/roberta_test_e20.pred.conll none

Output

version: 8.01 /media/data_dump/Ritwik/git/wl-coref/reference-coreference-scorers/lib/CorScorer.pm                                                                                                                        

METRIC muc:                                                                                                                                                                                                              
Repeated mention in the response: 231, 237 2121                                                                                                                                                                          
Repeated mention in the response: 119, 122 3030                                                                                                                                                                          
Repeated mention in the response: 158, 160 44                                                                                                                                                                            
Repeated mention in the response: 57, 62 1515                                                                                                                                                                            
Repeated mention in the response: 154, 158 55                                                                                                                                                                            
Repeated mention in the response: 76, 78 1313                                                                                                                                                                            

====== TOTALS =======                                                                                                                                                                                                    
Identification of Mentions: Recall: (17786 / 19764) 89.99%      Precision: (17786 / 20350) 87.4%        F1: 88.67%                                                                                                       
--------------------------------------------------------------------------                                                                                                                                               
Coreference: Recall: (13376 / 15232) 87.81%     Precision: (13376 / 15760) 84.87%       F1: 86.31%                                                                                                                       
--------------------------------------------------------------------------                                                                                                                                               

METRIC bcub:                                                                                                                                                                                                             
Repeated mention in the response: 154, 158 55                                                                                                                                                                            
Repeated mention in the response: 57, 62 1515                                                                                                                                                                            
Repeated mention in the response: 119, 122 3030                                                                                                                                                                          
Repeated mention in the response: 158, 160 44                                                                                                                                                                            
Repeated mention in the response: 76, 78 1313                                                                                                                                                                            
Repeated mention in the response: 231, 237 2121                                                                                                                                                                          

====== TOTALS =======                                                                                                                                                                                                    
Identification of Mentions: Recall: (17786 / 19764) 89.99%      Precision: (17786 / 20350) 87.4%        F1: 88.67%                                                                                                       
--------------------------------------------------------------------------                                                                                                                                               
Coreference: Recall: (16320.8804761468 / 19764) 82.57%  Precision: (15765.8216227679 / 20351) 77.46%    F1: 79.94%
--------------------------------------------------------------------------

METRIC ceafm:
Repeated mention in the response: 76, 78 1313
Repeated mention in the response: 57, 62 1515
Repeated mention in the response: 154, 158 55
Repeated mention in the response: 158, 160 44
Repeated mention in the response: 119, 122 3030
Repeated mention in the response: 231, 237 2121

====== TOTALS =======
Identification of Mentions: Recall: (17786 / 19764) 89.99%      Precision: (17786 / 20350) 87.4%        F1: 88.67%
--------------------------------------------------------------------------
Coreference: Recall: (16541 / 19764) 83.69%     Precision: (16541 / 20351) 81.27%       F1: 82.46%
--------------------------------------------------------------------------

METRIC ceafe:
Repeated mention in the response: 158, 160 44
Repeated mention in the response: 119, 122 3030
Repeated mention in the response: 57, 62 1515
Repeated mention in the response: 154, 158 55
Repeated mention in the response: 76, 78 1313
Repeated mention in the response: 231, 237 2121

====== TOTALS =======
Identification of Mentions: Recall: (17786 / 19764) 89.99%      Precision: (17786 / 20350) 87.4%        F1: 88.67%
--------------------------------------------------------------------------
Coreference: Recall: (3495.57012386207 / 4532) 77.13%   Precision: (3495.57012386207 / 4591) 76.13%     F1: 76.63%
--------------------------------------------------------------------------

METRIC blanc:
Repeated mention in the response: 76, 78 1313
Repeated mention in the response: 154, 158 55
Repeated mention in the response: 57, 62 1515
Repeated mention in the response: 158, 160 44
Repeated mention in the response: 119, 122 3030
Repeated mention in the response: 231, 237 2121

====== TOTALS =======
Identification of Mentions: Recall: (17786 / 19764) 89.99%      Precision: (17786 / 20350) 87.4%        F1: 88.67%
--------------------------------------------------------------------------

Coreference:
Coreference links: Recall: (98009 / 111931) 87.56%      Precision: (98009 / 121567) 80.62%      F1: 83.94%
--------------------------------------------------------------------------
Non-coreference links: Recall: (703839 / 883032) 79.7%  Precision: (703839 / 925055) 76.08%     F1: 77.85%
--------------------------------------------------------------------------
BLANC: Recall: (0.836345287908459 / 1) 83.63%   Precision: (0.783537821987766 / 1) 78.35%       F1: 80.9%
--------------------------------------------------------------------------

vdobrovolskii commented 1 year ago

Hmm. Each line that is supposed to be fed to the script is matched correctly:

Then it might be the case that when calling each metric separately the output is different...

I could investigate it further. Could you kindly modify the extract_f1 function as follows as run the script again? Then send me the output.

def extract_f1(proc: subprocess.CompletedProcess) -> float:
    prev_line = ""
    curr_line = ""
    for line in str(proc.stdout).splitlines():
        prev_line = curr_line
        curr_line = line
    print(repr(prev_line))
    return float(re.search(r"F1:\s*([0-9.]+)%", prev_line).group(1))

ritwikmishra commented 1 year ago

The issue was in the way you were converting bytes to string. As stated here; simply typecasting bytes to string using str() will give you unintended results. Bytes should be decoded in order to get appropriate strings.

Changing the for loop for line in str(proc.stdout).splitlines(): ---> for line in (proc.stdout).decode('utf-8').splitlines(): worked!

Output:

muc 86.31
ceafe 76.63
bcub 79.94
avg 80.96

vdobrovolskii / wl-coref

Conll perl script refusing to score because of "too many repeated mentions (>10) in the response" #37