Questions about wide networks

Hi, team! I'm interested in it but it's hard to get it. I have some questions about wide networks. 1) What's the prediction of wide networks, e.g., NNGP? Evaluated mean of GP? Is it deterministic? If so, what's different from a standard GP? 2) How is a wide network trained? For example, in NNGP, it seems that the network is trained very similar to a standard GP with matrix inversion. Is it right? 3) Is the training and inference mechanism the same for both finite and infinite-width networks in this package?

Hi Jinrae, Glad that you are interested in NT and we are more than happy to help. There are a couple tutorials on github that could be very useful; https://github.com/google/neural-tangents/tree/main/notebooks.

For NNGP, it is basically kernel regression/ bayesian inference. We need to pass either an infinite width NNGP kernel, which is deterministic; or a finite-width empirical NNGP kernel, which is stochastic, like random feature models.
Yes. NNGP is trained, more precisely "doing inference", using matrix inversion.
There are a couple "training approaches".
- finite width SGD training; which is the same as neural networks.
- NTK/NNGP related "inference", in which we use Bayesian inference/ matrix inversion. Finite/initinit-width networks are almost the same, the major difference is how the the kernel is computed.

This paper: https://arxiv.org/abs/1902.06720 may be helpful to clarify some concepts related to finite/infinite-width networks. Roughly, as the width approaches infinity, the (SGD) training dynamics of the finite width network will converge to something similar to kernel-regression/ bayesian inference.

Let us know if you have any other questions.

On Tue, Nov 30, 2021 at 10:34 AM Jinrae Kim @.***> wrote:

Hi, team! I'm interested in it but it's hard to get it. I have some questions about wide networks.

What's the prediction of wide networks, e.g., NNGP? Is it deterministic?

How is a wide network trained? For example, in NNGP, it seems that the network is trained very similar to a standard GP with matrix inversion. Is it right?

Is the training and inference mechanism the same for both finite and infinite-width networks in this package?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/neural-tangents/issues/131, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGC3MAY6K5WO4QWLLXH6SIDUOTVILANCNFSM5JCDY73A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Hi Jinrae, Glad that you are interested in NT and we are more than happy to help. There are a couple tutorials on github that could be very useful; https://github.com/google/neural-tangents/tree/main/notebooks.

For NNGP, it is basically kernel regression/ bayesian inference. We need to pass either an infinite width NNGP kernel, which is deterministic; or a finite-width empirical NNGP kernel, which is stochastic, like random feature models.

Yes. NNGP is trained, more precisely "doing inference", using matrix inversion.

There are a couple "training approaches".

finite width SGD training; which is the same as neural networks.

NTK/NNGP related "inference", in which we use Bayesian inference/ matrix inversion. Finite/initinit-width networks are almost the same, the major difference is how the the kernel is computed.

This paper: https://arxiv.org/abs/1902.06720 may be helpful to clarify some concepts related to finite/infinite-width networks. Roughly, as the width approaches infinity, the (SGD) training dynamics of the finite width network will converge to something similar to kernel-regression/ bayesian inference.

Let us know if you have any other questions.

On Tue, Nov 30, 2021 at 10:34 AM Jinrae Kim @.***> wrote:

Hi, team! I'm interested in it but it's hard to get it. I have some questions about wide networks.

What's the prediction of wide networks, e.g., NNGP? Is it deterministic?

How is a wide network trained? For example, in NNGP, it seems that the network is trained very similar to a standard GP with matrix inversion. Is it right?

Is the training and inference mechanism the same for both finite and infinite-width networks in this package?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/neural-tangents/issues/131, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGC3MAY6K5WO4QWLLXH6SIDUOTVILANCNFSM5JCDY73A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Thank you so much for your detailed answer. If you don't mind, please answer the following questions.

So,let's begin with infinitely wide NNGP. As I understood, from the fact that infinitely wide NN gives us a GP, and bayesian inference is performed by computing kernel recursively and matrix inversion to obtain mean and covariance at given test points, right?
In finitely wide NNGP, recursion of kernel seems not deterministic. Is that why you pointed out the difference between deterministic and stochastic kernel calculation of infinite and finite-width NNGP?
How does finite-width NNGP have the same training procedure? I thought that it has a stochastic network parameter, while an ordinary NN has a deterministic network parameter. I supposed that all we need to infer is kernel, not the network parameter. Is it right?

Sorry for my poor questions due to the lack of background of GP and NTK.

EDIT: I read the tutorial notebooks. For infinitely wide NNGP, only network architectures have meaning for Bayesian inference. In this regard, I don't understand why the tutorial usually constructs a finitely wide NN (e.g., 512 width) even for kernel calculation. Also, I'm not sure which one is preferred between the ensemble of finite NNs with randomised parameters and a simple NNGP.

Hey Jinrae,

Yes.
Exactly.
"while an ordinary NN has a deterministic network parameter". The parameter still depends on the random initialization (usually uses iid gaussian). So technically speaking, the training is not deterministic as it depends on the random init. E.g. people ensemble the outputs of the network trained using different random seeds to get better performance.
- I supposed that all we need to infer is kernel, not the network parameter. Is it right? Yes. we just need the kernel. however, how to obtain the kernel for finite width is another story. In general, when the width is not very large, the fluctuation in the kernel can be large. In our library, we use MC sampling to compute the kernel (i.e. average over many random initializations; Roman may know better about this point.). Yes, all we need is the kernel.

We are indeed very excited and eager to see NeuralTangents being used beyond the GP/NTK community. If you have further questions, please let us know !

Best,

On Tue, Nov 30, 2021 at 6:57 PM Jinrae Kim @.***> wrote:

Hi Jinrae, Glad that you are interested in NT and we are more than happy to help. There are a couple tutorials on github that could be very useful; https://github.com/google/neural-tangents/tree/main/notebooks.

1.

For NNGP, it is basically kernel regression/ bayesian inference. We need to pass either an infinite width NNGP kernel, which is deterministic; or a finite-width empirical NNGP kernel, which is stochastic, like random feature models. 2.

Yes. NNGP is trained, more precisely "doing inference", using matrix inversion. 3.

There are a couple "training approaches".

finite width SGD training; which is the same as neural networks.

NTK/NNGP related "inference", in which we use Bayesian inference/ matrix inversion. Finite/initinit-width networks are almost the same, the major difference is how the the kernel is computed.

This paper: https://arxiv.org/abs/1902.06720 may be helpful to clarify some concepts related to finite/infinite-width networks. Roughly, as the width approaches infinity, the (SGD) training dynamics of the finite width network will converge to something similar to kernel-regression/ bayesian inference.

Let us know if you have any other questions.

On Tue, Nov 30, 2021 at 10:34 AM Jinrae Kim @.***> wrote:

Hi, team! I'm interested in it but it's hard to get it. I have some questions about wide networks.

What's the prediction of wide networks, e.g., NNGP? Is it deterministic?

How is a wide network trained? For example, in NNGP, it seems that the network is trained very similar to a standard GP with matrix inversion. Is it right?

Is the training and inference mechanism the same for both finite and infinite-width networks in this package?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub

131 https://github.com/google/neural-tangents/issues/131, or

unsubscribe

https://github.com/notifications/unsubscribe-auth/AGC3MAY6K5WO4QWLLXH6SIDUOTVILANCNFSM5JCDY73A . Triage notifications on the go with GitHub Mobile for iOS

https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android

https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

Thank you so much for your detailed answer. If you don't mind, please answer the following questions.

So,let's begin with infinitely wide NNGP. As I understood, from the fact that infinitely wide NN gives us a GP, and bayesian inference is performed by computing kernel recursively and matrix inversion to obtain mean and covariance at given test points, right?

In finitely wide NNGP, recursion of kernel seems not deterministic. Is that why you pointed out the difference between deterministic and stochastic kernel calculation of infinite and finite-wide NNGP?

How does finite-width NNGP have the same training procedure? I thought that it has a stochastic network parameter, while an ordinary NN has a deterministic network parameter. I supposed that all we need to infer is kernel, not the network parameter. Is it right?

Sorry for my poor questions due to the lack of background of GP and NTK.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/neural-tangents/issues/131#issuecomment-983134851, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGC3MA45O4ONHZUEQGFWY23UOVQHFANCNFSM5JCDY73A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

google / neural-tangents

Questions about wide networks #131

131 https://github.com/google/neural-tangents/issues/131, or