As is evident, the prediction matches up almost 100% with target_input instead of target_output, as it should (off-by-one). This is because the target language is so repetitive; most of the "words" appear in a sequential manner. Therefore, the model is able to minimize loss simply by repeating the last word. The only time it gets it wrong is when the word changes. TrainingHelper enables this bad behavior, since it always feeds the model the correct output at the previous timestep.
I have thought of three main solutions to this problem, each with cons:
Use an inference-style decoder instead of TrainingHelper. This will prevent it from simply guessing the previous prediction. However, this will greatly slow down training, and possibly introduce unforeseen errors (do buckets work with inference-style decoders?).
Reweight the loss function to focus on key points where the sequence changes. This will hopefully push the model out of the plateau and improve its behavior. This will skew the model's training overall.
Reconfigure the target language such that the sequence A A A A A B B B becomes A5 A4 A3 A2 A1 B3 B2 B1, to incentivize it to predict changes in sequence. This will greatly increase the vocabulary size and thus training time.
Are any of these viable? What other alternatives are there to train a model with a repetitive target language?
Problem Summary
In the following example, my NMT model has high loss because it correctly predicts
target_input
instead oftarget_output
.As is evident, the prediction matches up almost 100% with
target_input
instead oftarget_output
, as it should (off-by-one). This is because the target language is so repetitive; most of the "words" appear in a sequential manner. Therefore, the model is able to minimize loss simply by repeating the last word. The only time it gets it wrong is when the word changes. TrainingHelper enables this bad behavior, since it always feeds the model the correct output at the previous timestep.I have thought of three main solutions to this problem, each with cons:
A A A A A B B B
becomesA5 A4 A3 A2 A1 B3 B2 B1
, to incentivize it to predict changes in sequence. This will greatly increase the vocabulary size and thus training time.Are any of these viable? What other alternatives are there to train a model with a repetitive target language?