Closed leexian closed 6 years ago
Yes, both A and B are filtered. The copying training is mostly done with cc_matrix, telling which words are copyable from source.
I'm still confused. cc_martix is built from A&B. So the copyable words that cc_matrix marks do not contain OOV words. The inputs for training (A, B and cc_martix ) carry no info from the OOV words. Do you mean the copying training is done on the words in vocabulary?
Yes, the original CopyNet implementation also works the same way. I think even only trained on the non-OOV words, the model also learns well about what to copy. If you want to take into account the OOV words during training, you have to create a temporary dict for each data to indicate these OOV words. I did this in my PyTorch implementation and you can check it out (not perfect but runnable).
Thanks for your explanation. It sounds work. I'll try your PyTorch implementation. You have done a great job!
I reread the code and found myself misunderstanding that piece of code. A&B contain all words appear in original text and the words beyond 'voc_size' are filtered before training. What a mistake. Sorry!
Ooops, actually I thought you meant this. Hope my answer is not misleading and is a little bit helpful :)
@leexian I happen to read this part of code and realize that I made mistake in my previous answer.
A = [word2idx[w] if w in word2idx else word2idx['
Here actually we don't filter out any OOV words from A and B because the word2idx contains all the words. I wrote in this way (w=w if w in word2idx else
The real filtering happens in the line 371 and 374 of keyphrase_copynet.py by the unk_filter().
Therefore, the alignments of OOVs are in cc_matrix and will be learned by model.
I refer to the following piece of code to build my own dataset. But I note that this code ignores the OOV words. What's the correct way to build the dataset that enables the "copy-mode"?
Moreover, how does cc_martix works?It seems that "cc[k][j][i]=1" marks the copy words in vocabulary.