question about building dataset

memray / seq2seq-keyphrase

MIT License

318 stars 109 forks source link

question about building dataset #14

Closed leexian closed 6 years ago

leexian commented 6 years ago

I refer to the following piece of code to build my own dataset. But I note that this code ignores the OOV words. What's the correct way to build the dataset that enables the "copy-mode"?

Moreover, how does cc_martix works?It seems that "cc[k][j][i]=1" marks the copy words in vocabulary.

memray commented 6 years ago

Yes, both A and B are filtered. The copying training is mostly done with cc_matrix, telling which words are copyable from source.

leexian commented 6 years ago

I'm still confused. cc_martix is built from A&B. So the copyable words that cc_matrix marks do not contain OOV words. The inputs for training (A, B and cc_martix ) carry no info from the OOV words. Do you mean the copying training is done on the words in vocabulary?

memray commented 6 years ago

Yes, the original CopyNet implementation also works the same way. I think even only trained on the non-OOV words, the model also learns well about what to copy. If you want to take into account the OOV words during training, you have to create a temporary dict for each data to indicate these OOV words. I did this in my PyTorch implementation and you can check it out (not perfect but runnable).

leexian commented 6 years ago

Thanks for your explanation. It sounds work. I'll try your PyTorch implementation. You have done a great job!

leexian commented 6 years ago

I reread the code and found myself misunderstanding that piece of code. A&B contain all words appear in original text and the words beyond 'voc_size' are filtered before training. What a mistake. Sorry!

memray commented 6 years ago

Ooops, actually I thought you meant this. Hope my answer is not misleading and is a little bit helpful :)

memray commented 6 years ago

@leexian I happen to read this part of code and realize that I made mistake in my previous answer. A = [word2idx[w] if w in word2idx else word2idx[''] for w in source] B = [[word2idx[w] if w in word2idx else word2idx[''] for w in p] for p in target]

Here actually we don't filter out any OOV words from A and B because the word2idx contains all the words. I wrote in this way (w=w if w in word2idx else ) in case of new data may contain any OOV word.

The real filtering happens in the line 371 and 374 of keyphrase_copynet.py by the unk_filter().

Therefore, the alignments of OOVs are in cc_matrix and will be learned by model.