Closed ryuichi0704 closed 3 years ago
1) Regarding addition - it's happening in this line: https://github.com/google/neural-tangents/blob/fd1611660c87edcb0c2e50403f691b60d2cc252b/neural_tangents/stax.py#L2652. This ensures that prod
in https://github.com/google/neural-tangents/blob/fd1611660c87edcb0c2e50403f691b60d2cc252b/neural_tangents/stax.py#L2667 is the first term under the square root in your formula, i.e. (1 + 2 \Sigma_{x x})(1 + 2 \Sigma_{\hat{x} \hat{x}})
.
2) Re divergence - please note that stax.Dense(2, 100)
is the same as stax.Dense(width=2, W_std=100)
, so your weight variance is very high, and I believe in this case it makes sense for NTK to become large with weight variance when x \approx \hat{x}
. I.e. in this case NTK = \Sigma * \Tau
, where \Sigma = W_std**2 x @ x.T
, so quadratic in W_std
, but \Tau ~ 1 / W_std
(i.e. only inverse-linear with W_std
, if x \approx \hat{x}
), so their products should be proportional to W_std
.
Lmk if this helps!
Thank you for your reply. I understand 👍
Hi, I just wondering if the erf is still nonlinear when calculating the NTK. how we convert a nonlinear activate function to linear? because each time when you calculate the NTK(x,X) you have nonlinear, how your finial closed form is linear?
Hi,
According to the Erf() in stax, I want to confirm the implementation. When we consider 2-layer MLP without training the last layer, the NTK is the covariance matrix of the data multiplied by
ref: Lee et al. 2019, Jiang et al. 2020.
Here, the unit matrix is added to the covariance matrix. However, I cannot find that part in the current stax implementation.
By using neural-tangents, I tried to visualize NTK with the default Erf (a=1.0, b=1.0, c=0.0), but the values seem to diverge when the inner product of the inputs is 0 or 1. The input vector length is normalized to be one. Am I missing something?
Code