Open mikeymezher opened 6 years ago
Also important to note proximal ids tend to be similar terms (ie 35000 ~= 35001). Which is why I know the above "top 5 ids indicies" (which equate to top 5 terms during decoding) are reasonable as they relate to their target during training.
I also experience decoding issues with models trained since T2T 1.10.0. The decoder reiterates the first word over and over again until hitting max_length. "How are you" for example is translated into "Wie Wie Wie Wie Wie Wie Wie Wie Wie Wie Wie Wie Wie Wie ..." Loss and evaluation stats during training are fine.
@mehmedes Out of curiosity is your problem type Text2Text (Full encoder + decoder stacks)?
I suppose so. I run a translate_wmt problem.
Description
Terms repeating themselves during decoding. Training and evaluation occur as normal. This doesn't appear to be an issue with decoding itself, as model trained in the past (pre T2T 1.10.0) perform fine.
This is occurring on a Text2Self type problem (defined in text_problems.py). I'm not quite sure where this problem stems. I've gone as far back as analyzing the logits produced in smoothing_cross_entropy (common_layers) and they appear fine
(ex: LABELS AT SOFTMAX CROSS ENTROPY: [[[[33481]][[1089]][[33480]][[33425]][[33423]][[3317]][[33424]][[33418]][[33416]][[3311]]]...] TOP 5 IDS INDICIES[[[[[10943 30030 10905 17983 31291]]][[[1087 2167 1085 1093 1079]]][[[33426 33484 33480 33481 33428]]][[[33481 33425 33412 33469 33482]]][[[33423 33414 33412 33426 33480]]][[[4435 4431 3319 4433 3321]]][[[33412 33424 33416 1 33417]]][[[33423 33418 3321 3319 3317]]][[[33416 33424 33421 33417 33423]]][[[3319 3317 33394 4393 4391]]]]...]TOP 5 IDS VALUES[[[[[6.24065685 5.62128639 5.55350065 5.52335882 5.27032423]]][[[9.33180618 9.04827118 8.96213 8.10716248 8.00762272]]]]...])
The first index is consistently wrong (the model would have no previous insight at this point) and the predictions continue to become more accurate - which seems correct. Evaluation also yields fine results.
During decode however, everything breaks - the imported weights lead to predicting somewhat reasonable terms for the second highest logit, but the first logit is consistently the most recent input term itself. I expected the inputs weren't being shifted to the right during training, but this doesn't appear to be the case.
...
Environment information
T2T 1.10.0 TF 1.11.0