R@50 computation for absent keyphrase generation

wasiahmad commented 3 years ago

Hi, I have a quick question to understand how R@50 is computed. If I understand correctly, the ACL2020 paper uses One2Seq style generation (with a beam size of 10). Duplicate keyphrases are removed after stemming, so I am confused that how 50 keyphrases can be generated such that we can compute R@50? If a model does not generate 50 keyphrases, what is the point of computing R@50?

Also when beam decoding is used, I wonder does the final prediction consists of 10 concatenated sequences of keyphrases? or only the best prediction (e.g., n_best = 1) sequence is evaluated?

memray commented 3 years ago

Hi Wasi,

Thank you for your questions.

For exhaustive decoding, we actually used a beam size of 50 and a maximum sequence length of 40. So in general there are a sufficient number of absent phrases.
And if only a very small number of phrases (less than 50) were predicted, we just take all predictions for computing recall (=#correct_kp/#groundtruth_kp). Yes, the name R@50 makes less sense in this case.
There could be more sequences than the given beam size, because oftentimes there are sequences terminated (special token ) prior to reaching the max length(40). And I will use all the predictions in the case of exhaustive search (often evaluated with F@10/@O for present and R@50 for absent). Also if by the end of the search, I retain all the results from all the active beams, even they don't predict in the end.

But if it's the case of self-terminating search (often evaluated with F@5/@M), we only use the phrases in the best sequence and drop the rest.

Let me know if you have any questions.

Rui

wasiahmad commented 3 years ago

To clarify my understanding, setting beam size to 50 and a maximum sequence length of 40 does not mean it will generate enough (>50) keyphrases. If I assume the average keyphrase length is 2, then you would get on average 20 keyphrases (ignoring special symbols that separate the keyphrases) with a maximum sequence length of 40. So, if we want to generate more keyphrases, we need to set n_best > 1 where n_best indicates the best n sequences returned by the beam search. Isn't it right?

If I understand correctly, the difference between the "self-terminating search" and "exhaustive search" is n_best is set to 1 and 50, respectively. Is it correct?

I am curious and want to know your opinion. Performing beam decoding with size 50 and length 40 requires a reasonably higher time than greedy decoding. But does the performance gain worth the decoding cost?

memray commented 3 years ago

Exactly! n_best denotes the number of beams/sequences that will be retained after beam search. In my setting, it is 500 (included in the config file config-test-keyphrase-one2one.yml and config-test-keyphrase-one2seq.yml). Generally, it is very easy for One2One models to collect 500 unique phrases but not the case for One2Seq due to the enormous amount of duplicates.

The difference between the "self-terminating search" (greedy decoding) and "exhaustive search" (proliferate the number of phrases by beam search), as discussed in our paper, is actually beam_size, which is 1 and 50 respectively. You can try with n_best=1 and beam_size=50 at the same time, which means only take the phrases from the top-ranked sequence returned by beam search, but the performance is generally worse than greedy decoding (beam_size=1).

You are right that it takes way longer to generate. Greedy is fast but only returns a small number of phrases, which is not directly comparable with many extractive models and may not be useful in certain applications (say in retrieval). The performance gain of larger beam size is also discussed in our recent paper, Figure 4. It is a pain and I believe a better/more efficient decoding strategy would be very helpful.

In practice, I generally use the first 2k examples of kp20k-valid (20k data examples) for validation (dubbed kp20k-valid2k) to select the best-performing checkpoint and then generate results on other test sets with this single checkpoint, rather than with all checkpoints.

Feel free to let me know if you have more questions. Thanks, Rui

wasiahmad commented 3 years ago

A quick question, why do you use 2k examples only? Just to save time? Do you use beam decoding to select the best checkpoint too?

memray commented 3 years ago

Yes, basically just to save time. We select the best checkpoint for greedy and beam search (also different beam size) separately. It means we run greedy on valid2k to find the best greedy checkpoint and run beam search on valid2k to find best beam search checkpoint.

wasiahmad commented 3 years ago

Why do you prefer to generate 500 keyphrases? Isn't it too much? I wonder if 500 keyphrases really make sense for a scientific article abstracts.

A follow-up question is let's say you generate 500 keyphrases, how do you pick 50 to compute R@50? Do you also consider the likelihood of each keyphrase from beam search? It might be difficult to measure individual likelihood in the One2Seq approach.

memray commented 3 years ago

My initial intention is just to save all the output phrases so I use 500. It turns out One2One can generate more than 500 phrases, but One2Seq models generate far less than 500 unique phrases. Note that only around 30 phrases are present and the rest are absent (model is to generate whatever phrases might be relevant because absent phrases are very difficult to predict).

I sort all the phrases based on their likelihood and evaluate with this ranked phrase list (say top 50). Yes for One2Seq it's a bit tricky since phrases in the same sequence share the same score, so we just follow their order in the sequence (the earlier it appears in the sequence the higher in the rank. Also see Ye and Wang 2018).

wasiahmad commented 3 years ago

Thanks for sharing all the details.

I was comparing the results of CatSeq in your ACL 2020 paper and Chan et al., 2019. I see for present keyphrases, F1@5 is reported in your paper is 31.4 and Chan reported 22.5. Both works used Open-NMT. I am kind of confused what is the reason behind this vast difference? Is it due to beam decoding? AFAIK, Chan used greedy decoding. Did you check how much is the performance difference if you use greedy decoding?

Also, I know people do a bit different things during post-processing the keyphrases before evaluation. I am interested to know if there is any key difference between your ACL 2020 paper and Chan et al., 2019 in terms of evaluation setting.

I reproduced the results of Chan et al., 2019 using their public code and used their evaluation script in my work. I also see a comparatively lower score for F1@5 in comparison to your work's result. So, I am curious to know what is that crucial difference in the evaluation.

Lastly, which one is your evaluation script? Let's say if I give you three files, source text, targets, and predictions, is there any script that can give me the final results?

Thanks in advance!

memray commented 3 years ago

F1@5 is reported in your paper is 31.4.

If you mean the scores in Table 2, yes, they are all beam-search results (beam_size=50). In Table 4 we report some results using greedy decoding. I think Chan et al. only presented greedy-decoding scores. The basic processing is the same as used in the previous study except that: (1) we manually clean up KP20k to correct many problematic data points (i.e. no/erroneous keyphrases); (2) we resplit Krapivin valid/test set, so this dataset may not be consistent with others. But it's source is also ACM author keywords (a major part of KP20k is also ACM), so comparing results on KP20k is more indicative.

post-processing

We did realize there might be discrepancies in post-processing so we include our script for separating present/absent phrases in this notebook.

evaluation script

My evaluation function is kp_evaluate.py, L51.

TBH, I don't know the exact reason for the difference in results between ours and Chan's. Please let me know if you find the problems.

Thanks, Rui

memray / OpenNMT-kpg-release

R@50 computation for absent keyphrase generation #24