"We conclude by suggesting that practitioners stick to linear warmup
with Adam, with a sensible default being linear warmup over
2·(1−β_2)^−1 training iterations."
We use the default Tensorflow Adam hyperparameters, where β_2 = 0.999,
2·(1−0.999)^−1 = 2000.
cc @DiveFish : I saw that you were training a sticker model the other day. You probably want to use this changed default. The improvements for Dutch and German are ~0.6 and 0.5% LAS.
Motivation, Ma & Yarats, 2019:
"We conclude by suggesting that practitioners stick to linear warmup with Adam, with a sensible default being linear warmup over 2·(1−β_2)^−1 training iterations."
We use the default Tensorflow Adam hyperparameters, where β_2 = 0.999, 2·(1−0.999)^−1 = 2000.
Fixes #169.