lspvic / CopyNet

CopyNet Implementation with Tensorflow and nmt
123 stars 52 forks source link

How does the model handle the OOV problem? #11

Open VieZhong opened 6 years ago

VieZhong commented 6 years ago

OOV means the out of vocalbary word.

I can't find any code to handle the problem, maybe I miss some important steps?

Looking forward to your advice or answers.

akanimax commented 5 years ago

@VieZhong,

I am also looking for the same. Were you able to find a solution for it? I'll explain my problem a bit formally:

Let's say I have a vocabulary of => ["hello", "I", "am", "akanimax"] and my source statement is => <"akanimax", "is", "a", "good", "boy"> and my target statement is => <"akanimax", "not", "a", "good", "boy">. Then, while decoding the "not" in the target, following are the two questions:

1.) When the input to the Encoder is "a" or "is" or "good" or "boy", what is actually sent to the Encoder RNN? Is it the same embedding representing \<copy> token or are they different randomly initialized embeddings?

2.) When "not" needs to be output, we have no other option than calling it UNK because it is not in chi nor in V. Is this correct?

I would be highly grateful if you could help.

Best regards, @akanimax

VieZhong commented 5 years ago

Hi, @akanimax I can't solve the OOV problem, either. My answer about your two questions may be that: 1) The words that model doesn't recognize will be noted as the same embedding token. 2) Yes, it is.

I hope I can help you. My English is not very well, forget it hh.

nlp4whp commented 5 years ago

Hi, @akanimax I can't solve the OOV problem, either. My answer about your two questions may be that:

  1. The words that model doesn't recognize will be noted as the same embedding token.
  2. Yes, it is.

I hope I can help you. My English is not very well, forget it hh.

Hi, @akanimax, @VieZhong

I think the OOV problem can be solved by CopyNet here. You see, the size of vocabulary (gen_vocab_size) for generate could be small, And another larger vocabulary including "OOV" for copy can be changed.

Although in real situation, we are probably unable to collect all tokens