tensorflow / nmt

TensorFlow Neural Machine Translation Tutorial
Apache License 2.0
6.38k stars 1.96k forks source link

[Question] Data preparation: sos eos tokens addition rule #245

Open gloriouskilka opened 6 years ago

gloriouskilka commented 6 years ago

Hello!

nmt/utils/iterator_utils.py: # Create a tgt_input prefixed with <sos> and a tgt_output suffixed with <eos>.

Why do we do this? Why splitting desired output? I know that

As I understood, we want output decoded sequence to start with sosand end with eos, and there is only one case where decoding stops on eos - if we use BeamSearchDecoder (does it stop right before this token?), which we don't use during training.

Also: During training I never see that generated sequence contains eos, but it has sos tokens generated after ~each sentence. I use BeamSearchDecoder during inference, and it never stops on eos tokens, so I fastfixed it to stop on sos tokens, which is weird.