torch / optim

A numeric optimization package for Torch.
Other
197 stars 154 forks source link

Slight modification of adam.lua causing different training losses with the same seed #158

Open szhengac opened 7 years ago

szhengac commented 7 years ago

I just came across a strange problem. I slightly modified some parts of adam.lua as follows:

  -- Initialization
   state.t = state.t or 0
   -- Exponential moving average of gradient values
   state.m = state.m or x.new(x:size()):zero()
   -- Exponential moving average of squared gradient values
   state.v = state.v or x.new(x:size()):zero()
   -- A tmp tensor to hold the sqrt(v) + epsilon
   state.denom = state.denom or x.new(x:size()):zero()

   -- (3) learning rate decay (annealing)
   local clr = lr / (1 + state.t*lrd)

   state.t = state.t + 1
   local biasCorrection1 = 1 - beta1^state.t 
   local biasCorrection2 = 1 - beta2^state.t 

   -- (1) evaluate f(x) and df/dx
   local fx, dfdx = opfunc(x)

   -- (2) weight decay
   if wd ~= 0 then
      dfdx:add(wd, x)
   end

I changed the order of (1), (2) and (3), and placed

local biasCorrection1 = 1 - beta1^state.t local biasCorrection2 = 1 - beta2^state.t

after state.t = state.t + 1. With such changes, the training losses can not be ensured the same even though I used the same seed. If I added a print() between state.t = state.t + 1 and local biasCorrection1 = 1 - beta1^state.t, then I can obtain the same training losses with multiple runs. The original adam.lua can produce the same results with multiple runs.

Does anyone have any idea about what might be happening?