google / neural-tangents

Fast and Easy Infinite Neural Networks in Python
https://iclr.cc/virtual_2020/poster_SklD9yrFPS.html
Apache License 2.0
2.29k stars 227 forks source link

"fx_train_0" and "fx_test_0" in "gradient_descent_mse" #78

Closed lionelmessi6410 closed 4 years ago

lionelmessi6410 commented 4 years ago

Thanks for your kind reply. It seems that the only difference between them is the covariance matrix. Does it mean that if I set compute_cov=False in predict_fn returned by gradient_descent_mse_ensemble, gradient_descent_mse_ensemble will be same as gradient_descent_mse? If so, which API has better performance, i.e. shorter computing time?

I noticed other attributes, fx_train_0 and fx_test_0 in gradient_descent_mse, representing the output of the network at t = 0 on training and test data, respectively. Based on my understanding, in the linearized neural networks, to get a precise approximation on the original network, tangent kernel and output at initialization, fx_train_0 and fx_test_0, are required.

However, in the infinite width limit, the tangent kernel converges to the deterministic kernel, thus there is no need to provide those values at initialization. I notice with an infinite width limit, you set the default value of both training and test data to 0., as following,

t = 1.0
k_train_train = kernel_fn(x_train, None, 'ntk')
k_test_train = kernel_fn(x_test, x_train, 'ntk')

predict_fn = nt.predict.gradient_descent_mse(k_train_train, y_train)
fx_train_t, fx_test_t = predict_fn(t=t, fx_train_0=0., fx_test_0=0., k_test_train=k_test_train)

These setting also appear in another work Disentangling Trainability and Generalization in Deep Neural Networks. However, if I set them with the value at initialization, it will be,

y_train_0 = apply_fn(params, x_train)
y_test_0 = apply_fn(params, x_test)
fx_train_t_0, fx_test_t_0 = predict_fn(t=t, fx_train_0=y_train_0, fx_test_0=y_test_0, k_test_train=k_test_train)

and

fx_train_t != fx_train_t_0
fx_test_t != fx_test_t_0

Why is this case? Moreover, the outputs(mean) of gradient_descent_mse_ensemble are very close to gradient_descent_mse with fx_train_0=0. and fx_test_0=0.. Does it imply if the infinite width limit case, we should better set those values to 0, while in the finite width network, we should provide the values at initialization?

_Originally posted by @lionelmessi6410 in https://github.com/google/neural-tangents/issue_comments/706645520_

romanngg commented 4 years ago

Sorry for the long delay, just commented in the original thread https://github.com/google/neural-tangents/issues/72 - let's continue there!