memray / seq2seq-keyphrase

MIT License
318 stars 109 forks source link

训练集问题 #8

Closed lihy-1224 closed 7 years ago

lihy-1224 commented 7 years ago

Hi,请问‘punctuation-20000validation-20000testing’里的每个文件的作用是什么呀,初学NLP有些不太懂,all_600k_dataset.pkl是所有raw数据吗?如果是的话,请问如何直接使用其它格式的文件作为训练集,比如kp20k数据集中的json格式?

memray commented 7 years ago

all_600k_dataset.pkl contains all the data examples of kp20k, in which around 530k examples are for training, 20k for validating and 20k for testing. validation_id_20000.pkl and testing_id_20000.pkl indicate the indexes of validation data and testing data in the all_600k_dataset.pkl respectively. Note that each file is very small, as only an array of indexes is included. all_600k_voc.json and all_600k_voc.pkl are all words in the vocabulary sorted by frequency. I remember they are not used in the model. inspec.testing.pkl is a small testing dataset for simple evaluation during training.

lihy-1224 commented 7 years ago

OK, I got it! But if I want to use my own data for training dataset, do I have to use "/keyphrase/dataset/keyphrase_dataset.py" to generate the pickled data?

memray commented 7 years ago

Yes, I think these two lines below matter most, but I've commented them in the code. Other lines are trivial, mostly about exporting something for data analysis: train_set, test_set, idx2word, word2idx = load_data_and_dict(config['training_dataset'], config['testing_dataset']) serialize_to_file([train_set, test_set, idx2word, word2idx], config['dataset'])

lihy-1224 commented 7 years ago

Thx, I understood it after reading your codes and paper carefully. But one thing still confuses me is the role of 'title' in your dataset, which contains 'abstract', 'title' and 'keyword'. Can I use a dataset without 'title'?

memray commented 7 years ago

Sure, I think you can leave the title blank. Because I concatenate the title and abstract to obtain the source text. You could customize your own preprocessing as well.

lihy-1224 commented 7 years ago

OK👌. May I ask you another question? I think config['testing_datasets'] indicates the datasets for extracting keyphrase, but what about config['testing_dataset'], what is it used for? Besides, I found the text in dataset/baseline-data/ folder is preprocessed by nltk.pos_tag(), but it seems unavailable for Chinese language. So I wanna know whether this project could be used for Chinese keyphrase extraction, or some codes in the project should be modified?

memray commented 7 years ago

config['testing_dataset'] is not very useful, only used for some quick testing during training. And you are right about the purpose of config['testing_datasets'].

I didn't try with Chinese therefore I cannot help too much. If you want to adapt my English model to Chinese data directly, I don't think it will work. If you mean to retrain the model on Chinese data, I see no problems. The only concern is the preprocessing part, which you may have to change the tokenization module and etc. I think the nltk.pos_tag() only works for the evaluation part.

lihy-1224 commented 7 years ago

Yes, I has followed your instructions to retrain the model and generate training and testing dataset.

But when I set config['do_predict'] = True, I could't see it extracted phrases into dataset/keyphrase/prediction/. Only when I set config['do_evaluate'] = True, the result in dataset/keyphrase/prediction/ is created.

So could you tell me which part of your codes is used for extracted phrases into dataset/keyphrase/prediction/? Because I didn't find it😓.

Sorry to bother you again.

memray commented 7 years ago

It's true that do_predict doesn't export the final keyphrases. It's because during predicting the model (deep model) would generate decode_logits (indexes of words) based on the given input text, but not really readable strings yet. Thus in the next step do_evaluate, a post-processing would convert these logits into strings. And the exporting process is located in keyphrase_utils.py from line 359 to line 367.