Is it possible for the prior of a Bayesian layer to be trainable?

nbro commented 4 years ago

So far, I've only seen examples of Bayesian neural networks with fixed priors and variable posteriors.

Can the priors of the Bayesian layers (such as DenseFlipout) be trainable (in some sense), in particular, e.g., if we use a Gaussian prior, can the variance be trainable? Is there something similar in TFP?

If I remember correctly, David MacKay proposed some model where the prior was trainable, although I don't remember exactly how that was done. However, this can be useful e.g. if you want to find the best variance for the prior.

I know we can pass a custom function to kernel_prior_fn that creates the prior. My knowledge of Keras tells me that you can create trainable parameters e.g. if you inherit the Layer class, but I would like to avoid using this API and use either the sequential or functional APIs. Is this possible? If yes, can you provide a simple example of how this would be done?

bgroenks96 commented 4 years ago

Is it possible to do in tensorflow? Probably. Like you said, just use trainable parameters in your prior distribution. Does it make sense? Not really. A "trainable" prior isn't really a prior at all, from a Bayesian perspective.

nbro commented 4 years ago

@bgroenks96 Are you sure you can have a trainable prior if you use the sequential or functional APIs?

If you can provide a minimal complete and executable example, I would really appreciate it, because, even though I've been using TF and TFP for a while, I am still not sure how to do a lot of stuff with them.

bgroenks96 commented 4 years ago

I am not sure, but I am saying that it doesn't make sense mathematically. By training your prior, you are implicitly conditioning it on your data; thus, it is no longer a prior but instead a likelihood.

In non-Bayesian deep learning, this would be equivalent to making your regularization coefficients trainable. It pretty much nullifies the effect of regularization because the model can just update the coefficients arbitrarily to minimize the loss function.

nbro commented 4 years ago

@bgroenks96 But I would be training only the variance (and not also the mean) of the prior. If I remember correctly, Neal said that MacKay used something like that in his earlier works. I would need to check the research papers again.

bgroenks96 commented 4 years ago

Yeah that's equivalent to training lambda in a ridge regression or LASSO model. There is a reason no-one does it. Perhaps the paper you read was describing variational parameters or hyperpriors.

bgroenks96 commented 4 years ago

If you are really committed to doing this, I think this is one way:

def prior_trainable(kernel_size, bias_size=0, dtype=None):
  n = kernel_size + bias_size
  return tf.keras.Sequential([
      tfp.layers.VariableLayer(n, dtype=dtype),
      tfp.layers.DistributionLambda(lambda t: tfd.Independent(
          tfd.Normal(loc=t, scale=1),
          reinterpreted_batch_ndims=1)),
  ])

as described by this tutorial. Note that, what the authors say at the end is not exactly true. By applying likelihood gradient updates to the prior, you are implicitly conditioning it on your data. Thus, it is no longer a true prior. In Bayesian analysis, we generally perform multiple analyses with various, fixed priors to see how sensitive the model is to changes in our prior beliefs. This is the correct approach if your goal is to build a Bayesian model.

I suppose you could interpret this as kind of "updating" your prior according to the previous posterior, like we do in Bayes networks. This isn't entirely clear, however, since you're using the gradient rather than the posterior distribution itself.

nbro commented 4 years ago

@bgroenks96 Thanks a lot for your example, comment and the information you provided. I had actually read that article several months ago, but I had forgotten about it.

Do you know of a good and readable research paper that talks about these hyper-priors and their effect on learning?

It's true we are using optimization to approximate posteriors. The optimization objective is typically the ELBO, which is a lower bound on the evidence of your data. So, if you maximize the ELBO, you also maximize the evidence of your data, but, at the same time, you also minimize the divergence between your prior and posterior. As you say, if the prior is learnable, then I am not sure what that will imply in terms of optimization. It could happen that the posterior doesn't change but it's the prior that changes. In that case, maybe the regularisation effect will be attenuated.

bgroenks96 commented 4 years ago

Hyperpriors are a concept from the hierarchical Bayesian modeling literature. I am not aware of any papers on this in deep learning (although I'm sure there must be at least a few). Hypernetworks (NNs predicting the parameters of a NN) would be an example of a deep-learning style hyperprior.

I think that, if you update your prior in VI, the model will just modify the prior variance in order to minimize the KL divergence with the variational posterior. My guess is that this would lead to very small prior variance and thus a sort of "over-confidence" in your predictions.

nbro commented 4 years ago

@bgroenks96 Now that you talk about hierarchical models, I am sure that it was Neal, in his PhD thesis, that talked about the original work by MacKay on the topic. Basically, MacKay was the person that introduced Bayesian learning to neural networks (i.e. Bayesian neural networks).

bgroenks96 commented 4 years ago

If you find the paper where he talks about "learning" the prior parameters, post a link. I would be curious to see how he squares this with Bayesian theory because it's not at all clear how to me how we should interpret the posterior in this case.

nbro commented 4 years ago

Actually, it's quite easy to have a trainable prior. Just initialize it in a similar way that you initialize the posterior.

saurabhdeshpande93 commented 3 years ago

@nbro, but doing that will simply make your kl term very insignificant right? For instance in the tensorflow blog they have only kept the mean as trainable. Also could you figure out a way to use trainable priors for Flipout layers?

tensorflow / probability

Is it possible for the prior of a Bayesian layer to be trainable? #882