Hi,
As mentioned in your paper that you used Macro-averaged scores, and the reported experiment results of present keyphrase prediction of catSeqD model on kp20k dataset is 0.285 (F1@5 metric) .
When I ran the catSeqD model with your code, I got a similar Macro-averaged score with you of 0.286 on the metric of F1@5, and the Micro-averaged score of 0.270.
But according the results reported in the paper One Size Does Not Fit All: Generating and Evaluating Variable Number of Keyphraseshttps://arxiv.org/abs/1810.05241
It seems that they used the Micro-averaged score.
And they got the score of 0.348 of catSeqD model on the metric of F1@5.
I am confused about the different results of the same model on the same dataset.
Is there anything wrong with this comparison ?
As mentioned in Section 6.2 of our paper, the implementation of F1@5 in our paper is different from Yuan et al. 2018. See the below screenshot for the reason.
Hi, As mentioned in your paper that you used Macro-averaged scores, and the reported experiment results of present keyphrase prediction of catSeqD model on kp20k dataset is 0.285 (F1@5 metric) .
When I ran the catSeqD model with your code, I got a similar Macro-averaged score with you of 0.286 on the metric of F1@5, and the Micro-averaged score of 0.270.
But according the results reported in the paper One Size Does Not Fit All: Generating and Evaluating Variable Number of Keyphrases https://arxiv.org/abs/1810.05241 It seems that they used the Micro-averaged score.
And they got the score of 0.348 of catSeqD model on the metric of F1@5.
I am confused about the different results of the same model on the same dataset. Is there anything wrong with this comparison ?