Open VieZhong opened 6 years ago
@VieZhong,
I am also looking for the same. Were you able to find a solution for it? I'll explain my problem a bit formally:
Let's say I have a vocabulary of => ["hello", "I", "am", "akanimax"] and my source statement is => <"akanimax", "is", "a", "good", "boy"> and my target statement is => <"akanimax", "not", "a", "good", "boy">. Then, while decoding the "not" in the target, following are the two questions:
1.) When the input to the Encoder is "a" or "is" or "good" or "boy", what is actually sent to the Encoder RNN? Is it the same embedding representing \<copy> token or are they different randomly initialized embeddings?
2.) When "not" needs to be output, we have no other option than calling it UNK because it is not in chi
nor in V
. Is this correct?
I would be highly grateful if you could help.
Best regards, @akanimax
Hi, @akanimax I can't solve the OOV problem, either. My answer about your two questions may be that: 1) The words that model doesn't recognize will be noted as the same embedding token. 2) Yes, it is.
I hope I can help you. My English is not very well, forget it hh.
Hi, @akanimax I can't solve the OOV problem, either. My answer about your two questions may be that:
- The words that model doesn't recognize will be noted as the same embedding token.
- Yes, it is.
I hope I can help you. My English is not very well, forget it hh.
Hi, @akanimax, @VieZhong
I think the OOV problem can be solved by CopyNet here.
You see, the size of vocabulary (gen_vocab_size
) for generate could be small,
And another larger vocabulary including "OOV" for copy can be changed.
Although in real situation, we are probably unable to collect all tokens
OOV means the out of vocalbary word.
I can't find any code to handle the problem, maybe I miss some important steps?
Looking forward to your advice or answers.