Open YoucanBaby opened 1 month ago
Hello, I am also curious about it. Have you got any clue about this?
Hello, I am also curious about it. Have you got any clue about this?
Sorry, no clues so far.
This is probably because there is a torch.nn.LayerNorm before the to_k_ip linear layer. Layernorm, which normalizes the input of a linear layer to a mean of 0. Let the input of this linear layer be x = [ x 1, x 2, ... XN ] , the output is y = [ y 1, y 2, ... y M]. Layernorm makes mean (x) = 0. During the gradient back propagation, y1 = x1x11 + x2w21 + ... xnwn1. The gradient of the layers after y 1 is represented by D, the sum of updates to the weight w11 to wn1 is sum (x1D + x2D + ... xnD) = mean(x) N D= 0. This means that the sum of the to_k_ip weights during training is the same as at the beginning of the training initialization, and i guess the sum of the weights at initialization is zero.
Dear author,
Thanks a lot for your great project.
Why is the mean of model weights all 0? Did you use any training tricks?
Best regards.