Closed sunprinceS closed 4 years ago
As some paper mentions that transformer need so-called warmup_steps, during inner-loop, should we use same lr forever, or change with outer-loop lr? In addition, which optimizer should we choose?
warmup_steps
Current implementation: *SGD with lr sqrt(d_model) sqrt(warmup_steps in outer-loop)** (The max. lr in outer-loop learning schedule)
Reasons:
As some paper mentions that transformer need so-called
warmup_steps
, during inner-loop, should we use same lr forever, or change with outer-loop lr? In addition, which optimizer should we choose?