Closed victorherbemontagne closed 6 years ago
Thanks Victor! :) You're right, we first centre the activations to 0 mean, then scale them to unit variance, we'll update the paper accordingly.
In terms of learning though the formulations are equivalent, since x = s * (x + b) = s * x + (s * b) = s * x + b'
, where b'
is the new bias. Note the loss
term sum(log(|s|))
prevents values in s
from becoming 0.
Hi,
First of all, thanks for this amazing work, it has been a pleasure to dive in the paper !
Now more precisely, when looking at the implementation and the implementation of the actnorm module, I can't understand the choice made considering the paper. In the paper you state that you used a affine transformation of the activation with parameters s and b.
But in the implementation it seems to first add the bias b:
x = x + b
(with actnorm_center), then multiply by s:x = s * (x+b)
(with actnorm_scale)You reverse the code when reverse = True but I feel this might be the opposite.
I surely miss something as you manage to train the model but I am curious about this choice.
Do I miss something?
Thank you in advance for your help !
Victor