Scoring question - Githubissues

qute012 commented 3 years ago

Thanks to great work!

I have one question about your paper. Your paper uses below formula.

I understood length of Intersection that formula, but your released code uses list as metrics not set.

for pred_id, pred_seq in enumerate(pred_seqs):
        if type == 'exact':
            match_score[pred_id] = 0
            for true_id, true_seq in enumerate(tgt_seqs):
                match = True
                if len(pred_seq) != len(true_seq):
                    continue
                for pred_w, true_w in zip(pred_seq, true_seq):
                    # if one two words are not same, match fails
                    if pred_w != true_w:
                        match = False
                        break
                # if every word in pred_seq matches one true_seq exactly, match succeeds
                if match:
                    match_score[pred_id] = 1
                    break

Some examples to explain my question.

preds = ['hello world', 'cute rabbit', 'cute rabbit', 'fast train']
targets = ['cute rabbit']

According to your code, The length corresponding to the intersection is 2. However, the actual intersection should be 1, not a list.

Is it right to follow this code to reimplement the performance of your paper?

memray commented 3 years ago

Here the code checks whether or not a predicted phrase (pred_seq) can match any of the ground-truth phrases (tgt_seqs). Each phrase has been broken down into a sequence of tokens.

For example, pred_seq=['cute', 'rabbit'] and targets = ['cute', 'rabbit'], it returns match_score=1 (all tokens exactly match) pred_seq=['cute', 'world'] and targets = ['cute', 'rabbit'], it returns match_score=0 (some tokens doesn't match) pred_seq=['a', 'cute', 'rabbit'] and targets = ['cute', 'rabbit'], it returns match_score=0 (number of tokens doesn't match)

With the match score, the score of the phrase list (precision/recall/Fscore) will be computed later (e.g. L124).

qute012 commented 3 years ago

Thanks to response @memray

So i was confused when i reimplemented experiment not using your code. If i used set type as metrics, score was lower about 10 f1-score. The performance is reproduced by calculating each of the same phrase that appear several times. I misunderstood formula that means model's predicted phrases that duplicated was removed.

memray commented 3 years ago

No problem! Some preprocessing does matter for scores here, like stemming and deduplication.

qute012 commented 3 years ago

Hi, @memray

I have more question. Your paper said using macro-average, but this line shows using micro-average. Can you explain that?

memray commented 3 years ago

@qute012 you can ignore the micro_ here, it's a bit confusing. I just compute p/r/f1 for each data point (there's an outer loop here) and all scores are saved to disk. macro-average over all data points is done in other places such as in this notebook, by simply running eval_avg_dict = {k: np.average(v) for k,v in eval_full_dict.items()}.

qute012 commented 3 years ago

@memray Thanks~ But i think micro-average is better scoring for not having keyphrase document, Let me assume below.

Document
A: 2 keyphrases
B: 0 keyphrases

Model
A: 4 preds, 2 corrects
B: 0 preds, 0 corrects

If micro-average can ignore B document not decreasing f1 score, F1@M below P = (2 + 0) / (4 + 0), R = (2 + 0) / (2 + 0), F1@M = (2 0.5 1)(0.5+1) = 0.5

And it's confusing also. your code looks like operating Precision, Recall and F1 each examples. But i know getting macro-average F1 scores using macro-average Precision and Recall, not average(F1 scores). But i can't find this operations in your code. Some example below

If macro-average PA = 0.5 , RA = 1, F1@M = (2 0.5 1)(0.5+1) = 0.5 PB = 0, RB = 0, F1@M = 0

method 1. Average F1@M = (0.5 + 0) / 2 = 0.25 #I mean this is not proper

method 2. Macro-Average Precision: (PA + PB)/2 = 0.25 Macro-Average Recall: 0.5 Macro-Average F1: (2 0.25 0.5)/(0.25+0.5)=0.3333...

I want to know which one is your.

Additionally, if I had calculated correctly, it's looks like better for not existing keyphrase in document. Because document without keyphrases can't reach 1 f1-score.

memray commented 3 years ago

@qute012 previously I did try removing documents if they don't have valid keyphrases. But it causes more inconsistent issues for people to reproduce results (e.g. different tokenization -> different present/absent phrase match -> different number of valid present/absent documents) and actually it does not affect scores very much.

I'm inclined to treat each doc equally (using macro-average) rather than each phrase (micro-average), but usually they show a strong correlation.

memray / OpenNMT-kpg-release

Scoring question #38