memray / OpenNMT-kpg-release

Keyphrase Generation
MIT License
216 stars 34 forks source link

About the statistics of present/absent keyphrase #21

Closed thinkwee closed 3 years ago

thinkwee commented 3 years ago

Hi~ I see in your paper Deep Keyphrase Generation you give proportion of the present keyphrases and absent keyphrases in four public datasets: image This result is different from the table in README. I also calculated the proportion based on the data you provided and get another number. I want to know how to get the correct result and how you defined present/absent keyphrase(only on testset? only present in abstract? after stemming?). Thank you !

memray commented 3 years ago

Thank you for the question and sorry for the confusion. The stats in README is more up-to-date and reliable, since we further cleaned the data after the first paper. We know the way of separating present/absent may affect the final scores and thus we provide updated statistics in the README and latest papers.

For determining present/absent phrases, the method remains the same (tokenization, digit replacement, word matching etc.) and you can check out this notebook, which provides the complete pipeline for this purpose.

Replies to your specific questions: 1. the numbers reported in README is on testset only; 2. only in abstracts (though NUS and semeval have fulltext); 3. after stemming and lowercase.

Thanks, Rui

qute012 commented 3 years ago

@memray

Best regards! I think stemming is just for increasing f1 score, not reflecting real keyphrase. And sentences became incorrect syntax. How do you think about?

Thanks.

memray commented 3 years ago

Hi @qute012 ,

Yes it's true. But there are many trivial form variants causing phrases to fail to match. So I think the benefits outweigh the downsides. Also, usually generation models are capable of predicting well-formed phrases (way better than extractors based on POS tag). So stemming is pretty useful in evaluations, for now.

Best, Rui

qute012 commented 3 years ago

@memray

Thanks to kind words!

In my case, i'm using extractive model with pretrained language model. In this case, i guess it's important to feed syntactic sentences to pretrained language model such as Bert. But i agree it's better idea with generative model. Is this a valid idea?

memray commented 3 years ago

@qute012 Oh I think I got your point. Do you mean to stem the input sentences? No, the input to the model is not stemmed. The stemming is only applied in evaluation. For example, given a few ground-truth keyphrases ['computers', 'calculations'], it is okay for the model to generate 'computer', 'compute', 'comput', 'calculat' etc., they all will be treated as correct predictions since they can match ground-truth phrases after stemming. But duplicate predictions are ignored in the evaluation, see here.

qute012 commented 3 years ago

Oh, @memray I got it! It' just for evaluation. Finally, I understand why this is helpful.

Thanks 👍