After experimentation on a nmt-realted tensorflow project - https://github.com/ssampang/im2latex - I realized that my greedy embedding decoder using the GreedyEmbeddingHelper performed better than my BeamSearchDecoder.
According to this paper, this is probably due to my training strategy that doesn't account for Exposure Bias: the model is never exposed to its own errors during training. Indeed, I am using the seq2seq TrainingHelper that only trains one step ahead, and the gold standard is used every time to predict the next sequence output.
Instead the paper claims significant improvements from a beam search optimized training strategy. Do you know if using the ScheduledEmbeddingTrainingHelper would suffice to account for this Exposure Bias ? Because this looks really close to what is done in the paper, except that only one trajectory would be considered here.
After experimentation on a nmt-realted tensorflow project - https://github.com/ssampang/im2latex - I realized that my greedy embedding decoder using the
GreedyEmbeddingHelper
performed better than myBeamSearchDecoder
.According to this paper, this is probably due to my training strategy that doesn't account for Exposure Bias: the model is never exposed to its own errors during training. Indeed, I am using the seq2seq
TrainingHelper
that only trains one step ahead, and the gold standard is used every time to predict the next sequence output.Instead the paper claims significant improvements from a beam search optimized training strategy. Do you know if using the
ScheduledEmbeddingTrainingHelper
would suffice to account for this Exposure Bias ? Because this looks really close to what is done in the paper, except that only one trajectory would be considered here.