Closed lihy-1224 closed 7 years ago
all_600k_dataset.pkl
contains all the data examples of kp20k, in which around 530k examples are for training, 20k for validating and 20k for testing.
validation_id_20000.pkl
and testing_id_20000.pkl
indicate the indexes of validation data and testing data in the all_600k_dataset.pkl
respectively. Note that each file is very small, as only an array of indexes is included.
all_600k_voc.json
and all_600k_voc.pkl
are all words in the vocabulary sorted by frequency. I remember they are not used in the model.
inspec.testing.pkl
is a small testing dataset for simple evaluation during training.
OK, I got it! But if I want to use my own data for training dataset, do I have to use "/keyphrase/dataset/keyphrase_dataset.py" to generate the pickled data?
Yes, I think these two lines below matter most, but I've commented them in the code. Other lines are trivial, mostly about exporting something for data analysis:
train_set, test_set, idx2word, word2idx = load_data_and_dict(config['training_dataset'], config['testing_dataset'])
serialize_to_file([train_set, test_set, idx2word, word2idx], config['dataset'])
Thx, I understood it after reading your codes and paper carefully. But one thing still confuses me is the role of 'title' in your dataset, which contains 'abstract', 'title' and 'keyword'. Can I use a dataset without 'title'?
Sure, I think you can leave the title
blank. Because I concatenate the title
and abstract
to obtain the source text. You could customize your own preprocessing as well.
OK👌. May I ask you another question? I think config['testing_datasets']
indicates the datasets for extracting keyphrase, but what about config['testing_dataset']
, what is it used for? Besides, I found the text
in dataset/baseline-data/
folder is preprocessed by nltk.pos_tag(), but it seems unavailable for Chinese language. So I wanna know whether this project could be used for Chinese keyphrase extraction, or some codes in the project should be modified?
config['testing_dataset']
is not very useful, only used for some quick testing during training.
And you are right about the purpose of config['testing_datasets']
.
I didn't try with Chinese therefore I cannot help too much. If you want to adapt my English model to Chinese data directly, I don't think it will work. If you mean to retrain the model on Chinese data, I see no problems. The only concern is the preprocessing part, which you may have to change the tokenization module and etc. I think the nltk.pos_tag() only works for the evaluation part.
Yes, I has followed your instructions to retrain the model and generate training and testing dataset.
But when I set config['do_predict'] = True
, I could't see it extracted phrases into dataset/keyphrase/prediction/
. Only when I set config['do_evaluate'] = True
, the result in dataset/keyphrase/prediction/
is created.
So could you tell me which part of your codes is used for extracted phrases into dataset/keyphrase/prediction/
? Because I didn't find it😓.
Sorry to bother you again.
It's true that do_predict
doesn't export the final keyphrases. It's because during predicting the model (deep model) would generate decode_logits (indexes of words) based on the given input text, but not really readable strings yet. Thus in the next step do_evaluate
, a post-processing would convert these logits into strings. And the exporting process is located in keyphrase_utils.py
from line 359 to line 367.
Hi,请问‘punctuation-20000validation-20000testing’里的每个文件的作用是什么呀,初学NLP有些不太懂,all_600k_dataset.pkl是所有raw数据吗?如果是的话,请问如何直接使用其它格式的文件作为训练集,比如kp20k数据集中的json格式?