google / seq2seq

A general-purpose encoder-decoder framework for Tensorflow
https://google.github.io/seq2seq/
Apache License 2.0
5.61k stars 1.3k forks source link

How to prevent UNK from being generated at inference time? #203

Open nasrinm opened 7 years ago

nasrinm commented 7 years ago

For applications other than MT (e.g., image captioning, conversation modeling, etc.) it's crucial to prevent UNK from being generated at inference time. For greedy decoding, I could fix this by modifying the sample() function of the GreedyEmbeddingHelper class to simply choose top_2 (instead of argmax) indices with the highest prob and choose the one that's not UNK. However, the same idea doesn't work for beam search. I'm modifying the _choose_topk() function of the inference.beam_search as follows:

def choose_top_k(scores_flat, config):
  K = config.beam_width
  UNK_ID = int(config.vocab_size) - 3
  # Instead of top-K, retrieve top-K+1 in case the UNK token appears in top K.
  next_beam_scores, word_indices = tf.nn.top_k(scores_flat, k=K+1)

  # Get a Boolean condition tensor which can indicate the UNK index
  condition = tf.not_equal(word_indices[:K], [UNK_ID]) 

  # For any possible UNK token, copy the value of word_indices[K] instead.
  selected_indices = tf.where(condition, word_indices[:K], [word_indices[K]]*(K) )
  selected_scores = tf.where(condition, next_beam_scores[:K], [next_beam_scores[K]]*(K) )
  return selected_scores, selected_indices

When I decode using the updated beam search, the UNK token still appears in the predictions. Is there anywhere else in the beam search inference that might re-introduce the UNK token? My inference script is simply the following:

  python -m bin.infer \
  --tasks "
    - class: DecodeText
    - class: DumpBeams
      params:
        file: ${PRED_DIR}/beams.npz" \
  --model_dir $MODEL_DIR \
  --model_params "
    inference.beam_search.beam_width: 2"\
  --input_pipeline "
    class: ParallelTextInputPipeline
    params:
      source_files:
        - $DEV_SOURCES" \
  > ${PRED_DIR}/predictions.txt

Thanks

Scitator commented 7 years ago

Look here

nasrinm commented 7 years ago

Thanks for the link, however, I had seen this UNK replacement before and it's not a solution for applications such as image captioning or conversation modeling. What I need is not an UNK 'replacement' but a 'prevention' of UNK generation at inference time.

vijaydaultani commented 7 years ago

@nasrinm If I am understanding correctly you want unk_replace option for beam search? unk_replace which is UNK token replacement using the copy mechanism is a parameter for DecodeText Task during inference. In the current implementation it uses attention mechanism to generate the attention_scores for the encoded input. But to my understanding it currently only works for beam of width 1 (basically without beam search).

I am also interested in using unk_replace using copy mechanism for beam search (beam_width > 1). Although I think using the current implementation it's not possible to do so. Why? Read this lines from here

The array file can be loaded used numpy and will contain a list of arrays with shape [target_length, source_length]

To verify if you look at task DumpAttention inside file dump_attention.py at function _get_scores (which dump numpy attention scores) that output numpy array is a list of array of shape [target_length, source_length]. Therefore the code right now generates attention scores for only one target (not for multiple targets as in case of beam search). If we would like to generate attention scores for multiple target queries from the beam search, we rather need to generate 3 dimensional numpy array something like [target_num, target_length, source_length]. Infact Issue #174 is generated because of the same limitation of the current implementation as it does not generate attention scores for multiple beams.

Anyways I would appreciate if someone here in the community can suggest good way to approach this functionality.

liyi193328 commented 7 years ago

@nasrinm I think your idea and code is all right then. But how you solve the problem then ?

liyi193328 commented 7 years ago

@nasrinm Finding a bug. The scores_flat is flatten from [batch_size, vocab_size]. So unk index appear in multiple indexs: vocab_size-3, 2vocab-3, ..., batch_sizevocab_size -3.They must be all masked. A simple fix is mask the score([batch_size, vocab_size]) 's unk index to 0, So they can't choosen by top_k.

And this is one simple way to handle unk problem. Another is use attention score to replace the unk when do beam search, this may change a bit codes and structure then. Wish to exchange. Thanks.