Evaluation results not reproducible

sunyilgdx / SIFRank

The code of our paper "SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-trained Language Model"

120 stars 21 forks source link

Evaluation results not reproducible #7

Closed charlesBak closed 2 years ago

charlesBak commented 2 years ago

Hi Sun Yi,

thank you for sharing your valuable work. I have run the evaluation myself and the results I obtain slightly differ from the ones in your paper. Below are the evaluation results for SIFRank on the SemEval2017 dataset.

N=5
P=0.4864097363083164
R=0.14057920037519053
F1=0.21811897398581043

N=10
P=0.43448275862068964
R=0.25114315863524445
F1=0.31830002228991755

N=15
P=0.39143979412163077
R=0.3388439441904092
F1=0.3632478632478633
totally cost 501.45456099510193

Could you explain what might be the reason?

Best regards, Charles

sunyilgdx commented 2 years ago

The version of the Stanford CoreNLP may influence the final performance. We used stanford-corenlp-full-2018-02-27. Which version do you use?

charlesBak commented 2 years ago

I use the same version

sunyilgdx commented 2 years ago

The version of NLTK? Cuz this model is not random. If the versions of all packages are the same, there should be the same results

charlesBak commented 2 years ago

nltk 3.5. I will downgrade it to 3.4.3 and test again.

charlesBak commented 2 years ago

I still obtain the same results. I have all dependencies matching the ones you have specified. Did you use the same script for the evaluation?

sunyilgdx commented 2 years ago

I verified that the results are the same on two windows10 computers, while using sifrank_eval.

charlesBak commented 2 years ago

Thanks @sunyilgdx. I'll figure it out.

sunyilgdx commented 2 years ago

Sorry for not being able to solve it for you, but I still hope to find the issue and the solution. @charlesBak

charlesBak commented 2 years ago

No problem at all. But i have another question. In the paper you have mentioned that you have used ELMO L0 when document is shorter than 128 and L1 when document is larger than 128. Do you mean tokens? Do you care to explain it more?

sunyilgdx commented 2 years ago

Sorry, the expression in the paper seems to be a bit problematic. Our experimental results on the three datasets indicate that long texts need to use L0, and short texts need to use L1. But it is only the experimental results on these three datasets. In other cases, the parameters need to be adjusted according to the specific scenarios. The parameter settings in this paper are as follows, which can also be seen in the code.

if(database == "Inspec"):
    data, labels = fileIO.get_inspec_data()
    lamda = 0.6
    elmo_layers_weight = [0.0, 1.0, 0.0]
elif(database == "Duc2001"):
    data, labels = fileIO.get_duc2001_data()
    lamda = 1.0
    elmo_layers_weight = [1.0, 0.0, 0.0]
else:
    data, labels = fileIO.get_semeval2017_data()
    lamda = 0.6
    elmo_layers_weight = [1.0, 0.0, 0.0]