Closed zhuchen03 closed 1 year ago
I think -1/4 in the exponent is correct. Each FC layer's weight (say w1 and w2) will be squared, as in 3.1's calculation, which gives -1/2 in the exponent. Then |w1|^2 * |w2|^2 will give the desired -1 in the exponent.
Hi, thanks for sharing the code.
I am wondering why the norm of the two FC layers for the MLP blocks are set to (9N)^{-1/4}. I feel it should be something like (9N)^{-1/2} if a similar analysis as Theorem 3.1 is applied to each of the MLP layers.