Closed qute012 closed 3 years ago
Here the code checks whether or not a predicted phrase (pred_seq) can match any of the ground-truth phrases (tgt_seqs). Each phrase has been broken down into a sequence of tokens.
For example, pred_seq=['cute', 'rabbit'] and targets = ['cute', 'rabbit'], it returns match_score=1 (all tokens exactly match) pred_seq=['cute', 'world'] and targets = ['cute', 'rabbit'], it returns match_score=0 (some tokens doesn't match) pred_seq=['a', 'cute', 'rabbit'] and targets = ['cute', 'rabbit'], it returns match_score=0 (number of tokens doesn't match)
With the match score, the score of the phrase list (precision/recall/Fscore) will be computed later (e.g. L124).
Thanks to response @memray
So i was confused when i reimplemented experiment not using your code. If i used set type as metrics, score was lower about 10 f1-score. The performance is reproduced by calculating each of the same phrase that appear several times. I misunderstood formula that means model's predicted phrases that duplicated was removed.
No problem! Some preprocessing does matter for scores here, like stemming and deduplication.
Hi, @memray
I have more question. Your paper said using macro-average, but this line shows using micro-average. Can you explain that?
@qute012 you can ignore the micro_
here, it's a bit confusing. I just compute p/r/f1 for each data point (there's an outer loop here) and all scores are saved to disk. macro-average over all data points is done in other places such as in this notebook, by simply running eval_avg_dict = {k: np.average(v) for k,v in eval_full_dict.items()}
.
@memray Thanks~ But i think micro-average is better scoring for not having keyphrase document, Let me assume below.
Document
A: 2 keyphrases
B: 0 keyphrases
Model
A: 4 preds, 2 corrects
B: 0 preds, 0 corrects
If micro-average can ignore B document not decreasing f1 score, F1@M below P = (2 + 0) / (4 + 0), R = (2 + 0) / (2 + 0), F1@M = (2 0.5 1)(0.5+1) = 0.5
And it's confusing also. your code looks like operating Precision, Recall and F1 each examples. But i know getting macro-average F1 scores using macro-average Precision and Recall, not average(F1 scores). But i can't find this operations in your code. Some example below
If macro-average PA = 0.5 , RA = 1, F1@M = (2 0.5 1)(0.5+1) = 0.5 PB = 0, RB = 0, F1@M = 0
method 1. Average F1@M = (0.5 + 0) / 2 = 0.25 #I mean this is not proper
method 2. Macro-Average Precision: (PA + PB)/2 = 0.25 Macro-Average Recall: 0.5 Macro-Average F1: (2 0.25 0.5)/(0.25+0.5)=0.3333...
I want to know which one is your.
Additionally, if I had calculated correctly, it's looks like better for not existing keyphrase in document. Because document without keyphrases can't reach 1 f1-score.
@qute012 previously I did try removing documents if they don't have valid keyphrases. But it causes more inconsistent issues for people to reproduce results (e.g. different tokenization -> different present/absent phrase match -> different number of valid present/absent documents) and actually it does not affect scores very much.
I'm inclined to treat each doc equally (using macro-average) rather than each phrase (micro-average), but usually they show a strong correlation.
Thanks to great work!
I have one question about your paper. Your paper uses below formula.
I understood length of Intersection that formula, but your released code uses list as metrics not set.
Some examples to explain my question.
According to your code, The length corresponding to the intersection is 2. However, the actual intersection should be 1, not a list.
Is it right to follow this code to reimplement the performance of your paper?