memray / seq2seq-keyphrase

MIT License
318 stars 109 forks source link

Some questions about kp20k dataset #29

Closed kugwzk closed 5 years ago

kugwzk commented 5 years ago

Hi@memray~ Thanks for releasing the kp20k dataset, but after read the paper, I still have some question about this dataset.

  1. In the paper said the training set is 527830 articles, val set is 20000 and the test set is 20000. But I see in your release dataset there are 530809 articles. If you removed the same articles in other testset such as Inspec and SemEval to get the training set's number to 527830?
  2. And in the paper you said you didn't use the whole training set, and you used only the val set to train( there maybe some questions, I don't get very well). Could you explain about that?
  3. Because I am doing experiments about keyphrase extraction not generation, do you think I can use this dataset if I remove some absent keyphrases in dataset?
memray commented 5 years ago

Hi @kugwzk ,

  1. The real number of data examples used for training might be smaller because I filter them on the fly. The real number is around 500k. Yes papers from testset are removed.
  2. Sorry for the confusion. We only use the small val set to train two supervised baseline models (Maui and KEA). The deep learning models are trained with 500k papers.
  3. Yes it should work.
kugwzk commented 5 years ago

Sorry, I still have some confusion about Answer 1. If you mean you filter the training set at the begin of training or during your training? And could you tell about how do you filter? Thanks a lot~

memray commented 5 years ago

I did both: (1) papers in testset are excluded beforehand (2) some noisy data is removed on the fly (no keyword or text is too long).

kugwzk commented 5 years ago

Thanks a lot~