about data source - Githubissues

memray / OpenNMT-kpg-release

Keyphrase Generation

MIT License

217 stars 34 forks source link

about data source #19

Closed johncs999 closed 4 years ago

johncs999 commented 4 years ago

Hi, memray, can you give more details about the data source? (e.g. which web site does the abstracts in kp20k/semeval/.. come from? ) I find that the results of some test datasets (e.g. semeval, inspec) is relatively worse than others, I feel that there may be differences in data distribution. What do you think of this problem?

memray commented 4 years ago

KP20k was collected by me and my colleagues from different sources, e.g. ACM, Wiley, ScienceDirect, Elsevier. All keywords are provided by original authors.

As for test datasets such as semeval/nus/inspec/krapivin, they are from previous studies (say this repo) and most of them (except for krapivin which is also author keywords) are annotated additionally. So models perform almost the same on KP20k and krapivin, but not on the others.

Hope this helps.