Setting max_length low makes BLEU unexpectedly worse

martinpopel commented 6 years ago

Sentences longer than the parameter max_length are excluded from training and lowering this parameter helps to prevent OOM errors and allows to use higher batch_size, so it is quite useful. Unfortunately, setting this parameter too low results in low BLEU and retarded learning curves. The graph below shows curves (evaluated on dev set) for max_length 25, 50, 70, 150, 200 and 400: 1gpu-max_length-b1500

There are two possible explanations, but I think both of them are false:

Setting max_length too low makes the training data smaller. However, with max_length=70 only 2.1% of my training sentences are excluded. Moreover, the "70" BLEU curve is decreasing after the first hour of training, while processing the whole training data (one epoch) takes more than two days of training.
A model trained on short sentences only does not achieve good results when applied on long sentences. However, there are only 2.2% sentences longer than 70 subwords in my dev set (and 0.3% sentences longer than 100 subwords), so this does not seem to be the cause either.

When I increased the batch_size from 1500 to 2000, the results improved: the "25" and "50" curves were still retarded, but "70" and higher achieved the same result as when training without any max_length restriction. Can someone explain this? Or even fix it if it is a bug?

noe commented 6 years ago

@martinpopel are these numbers from tensor2tensor 1.2.9 or from a more recent version? (I ask this in relation to bug #529 , as 1.2.9 is the version some of us are working in).

martinpopel commented 6 years ago

@noe: Yes, these numbers (graph) are with 1.2.9.

mehmedes commented 6 years ago

@martinpopel How did you find out how many subwords your sentences have?

martinpopel commented 6 years ago

@mehmedes using this ad-hoc script

tensorflow / tensor2tensor

Setting max_length low makes BLEU unexpectedly worse #582