tensorflow / probability

Probabilistic reasoning and statistical analysis in TensorFlow
https://www.tensorflow.org/probability/
Apache License 2.0
4.16k stars 1.08k forks source link

Flipout Monte Carlo estimator #213

Open BelhalK opened 5 years ago

BelhalK commented 5 years ago

Hi all,

In the bayesian_neural_network.py example, how many samples are used by default to calculate the Flipout Monte Carlo estimator. If I refer to https://www.tensorflow.org/probability/api_docs/python/tfp/layers/Convolution2DFlipout:

It uses the Flipout gradient estimator to minimize the Kullback-Leibler divergence up to a constant, also known as the negative Evidence Lower Bound. It consists of the sum of two terms: the expected negative log-likelihood, which we approximate via Monte Carlo; and the KL divergence, which is added via regularizer terms which are arguments to the layer.

Monte Carlo approximation is used to approximate the expected negative log-likelihood but I wonder how many samples are used. Is there a way to increase that number samples to better approximate my ELBO at each iteration?

Thanks

BelhalK commented 5 years ago

From the init:

__init__(
    filters,
    kernel_size,
    strides=(1, 1),
    padding='valid',
    data_format='channels_last',
    dilation_rate=(1, 1),
    activation=None,
    activity_regularizer=None,
    kernel_posterior_fn=tfp_layers_util.default_mean_field_normal_fn(),
    kernel_posterior_tensor_fn=(lambda d: d.sample()),
    kernel_prior_fn=tfp.layers.default_multivariate_normal_fn,
    kernel_divergence_fn=(lambda q, p, ignore: tfd.kl_divergence(q, p)),
    bias_posterior_fn=tfp_layers_util.default_mean_field_normal_fn(is_singular=True),
    bias_posterior_tensor_fn=(lambda d: d.sample()),
    bias_prior_fn=None,
    bias_divergence_fn=(lambda q, p, ignore: tfd.kl_divergence(q, p)),
    seed=None,
    **kwargs
)

I suppose that it is not possible to pass the Monte Carlo scheme to the layer. I am still wondering how many MC samples are used for this approximation? Any idea or code to look at?

Thanks :)

brianwa84 commented 5 years ago

I think you're probably after the details of the flipout layers, and not simply the num_monte_carlo flag (defaults to 50) used to evaluate holdout logprob.

The flipout layers (afaik) should only draw a single Indep(Normal) sample shared across all batch elements and then a sample a Indep(Rademacher) per batch element to elementwise multiply with the all-batch normal. At least, that's how I remember flipout. https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/layers/conv_variational.py#L1075

There is only one output of a layer, so it's not clear to me how you would want to change this? Maybe you'd want to wrap the outputs += perturbed_inputs in a loop [sampling many perturbed_inputs]? Since the expectation of perturbed_inputs should be zero (right?), wouldn't this be basically equivalent to reducing the scale of the normal prior?

On Wed, Nov 7, 2018 at 10:23 AM Belhal KARIMI notifications@github.com wrote:

From the init:

init( filters, kernel_size, strides=(1, 1), padding='valid', data_format='channels_last', dilation_rate=(1, 1), activation=None, activity_regularizer=None, kernel_posterior_fn=tfp_layers_util.default_mean_field_normal_fn(), kernel_posterior_tensor_fn=(lambda d: d.sample()), kernel_prior_fn=tfp.layers.default_multivariate_normal_fn, kernel_divergence_fn=(lambda q, p, ignore: tfd.kl_divergence(q, p)), bias_posterior_fn=tfp_layers_util.default_mean_field_normal_fn(is_singular=True), bias_posterior_tensor_fn=(lambda d: d.sample()), bias_prior_fn=None, bias_divergence_fn=(lambda q, p, ignore: tfd.kl_divergence(q, p)), seed=None, **kwargs )

I suppose that it is not possible to pass the Monte Carlo scheme to the layer. I am still wondering how many MC samples are used for this approximation? Any idea or code to look at?

Thanks :)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/213#issuecomment-436660320, or mute the thread https://github.com/notifications/unsubscribe-auth/AVJZIypkFLD-eZeW0dTusxAnYyRGBlu8ks5usvrtgaJpZM4YNUqU .

BelhalK commented 5 years ago

Only one Monte Carlo sample by default? Naturally, the Flipout MC estimator can be all the more accurate as the number of MC samples increases. Just for context, I am testing my stochastic optimization method on a Bayesian neural net and would like to make sure that at each iteration, my ELBO is "well" approached. Thus my need to increase the MC samples.

I was originally using Edward, before its integration to TFP, where the loss was computed using the reparametrization trick (Blundell et. al.). Do you know how to use this loss again with TFP? I guess there is a layer for this that is not ConvFlipout

brianwa84 commented 5 years ago

The flipout loss is still computed using the reparameterization trick (aka pathwise derivative, etc). The nice thing with flipout is that you get a lower variance estimate of the gradient that is still unbiased. Are you finding that your model is strictly worse with tfp trainable layers than it was with edward?

On Thu, Nov 8, 2018 at 3:37 AM Belhal KARIMI notifications@github.com wrote:

Only one Monte Carlo sample by default? Naturally, the Flipout MC estimator can be all the more accurate as the number of MC samples increases. Just for context, I am testing my stochastic optimization method on a Bayesian neural net and would like to make sure that at each iteration, my ELBO is "well" approached. Thus my need to increase the MC samples.

I was originally using Edward, before its integration to TFP, where the loss was computed using the reparametrization trick (Blundell et. al.). Do you know how to use this loss again with TFP? I guess there is a layer for this that is not ConvFlipout

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/213#issuecomment-436914756, or mute the thread https://github.com/notifications/unsubscribe-auth/AVJZI6WVG7EsYdBUL6DBWayhe-JFTZFZks5us-1EgaJpZM4YNUqU .

BelhalK commented 5 years ago

I actually observe the variance reduction, no problem with that. I was just saying that in Edward (just like in Bayes By Backprop paper), the ELBO was approximated using a Monte Carlo batch of many samples, not just one as in the Flipout. (Even though I noticed that in the paper a more general expression of the FlipoutMC estimator involves a batch size M that would improve the Monte Carlo approximation. I just need to make sure my ELBO is well approximated at each training iteration. I guess I will try your suggested technique

Maybe you'd want to wrap the outputs += perturbed_inputs in a loop [sampling many perturbed_inputs]?

brianwa84 commented 5 years ago

So you're more referring to a mini batch, a batch of data. That should be handled in the higher level main file, and I'm sure it uses batches. The size is 128: https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/bayesian_neural_network.py#L65

On Fri, Nov 9, 2018, 4:39 AM Belhal KARIMI <notifications@github.com wrote:

I actually observe the variance reduction, no problem with that. I was just saying that in Edward (just like in Bayes By Backprop paper), the ELBO was approximated using a Monte Carlo batch of many samples, not just one as in the Flipout. (Even though I noticed that in the paper a more general expression of the FlipoutMC estimator involves a batch size M that would improve the Monte Carlo approximation. I just need to make sure my ELBO is well approximated at each training iteration. I guess I will try your suggested technique

Maybe you'd want to wrap the outputs += perturbed_inputs in a loop [sampling many perturbed_inputs]?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/213#issuecomment-437304456, or mute the thread https://github.com/notifications/unsubscribe-auth/AVJZI8f6ev3bClpeYywWaAzJu4vXIKj8ks5utU08gaJpZM4YNUqU .

BelhalK commented 5 years ago

Sorry if I was not clear enough. I was talking about the Monte Carlo batch (the samples drawn from the variational distribution to compute the Flipout estimator) not the batch of data. There is no obvious way to increase that MC batch size apparently

brianwa84 commented 5 years ago

I would think increasing the data batch has the effect you want. A larger data batch will induce correspondingly more rademacher draws to be multiplied with the fixed scaling drawn once per iteration. I don't remember flipout drawing multiple scales per iteration, but it's been a long time since I read the paper. Is multiple scales indeed what you are after? If so I think you might need something different from my proposal: perhaps a scale for the first half of the minibatch and a scale for the second half (etc). You could use tf.tile of a small number of scales (eg. 2 for half and half) to achieve this.

On Fri, Nov 9, 2018, 6:31 AM Belhal KARIMI <notifications@github.com wrote:

Sorry if I was not clear enough. I was talking about the Monte Carlo batch (the samples drawn from the variational distribution to compute the Flipout estimator) not the batch of data. There is no obvious way to increase that MC batch size apparently

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/213#issuecomment-437333292, or mute the thread https://github.com/notifications/unsubscribe-auth/AVJZI6FZ7PU4V7zaaKcZM-SDhItsvNpIks5utWeigaJpZM4YNUqU .

brianwa84 commented 5 years ago

I just went back to the paper. The only place I see M is the evolutionary strategies sections. The rather there is cross gpu parallelism: a different variational sample per machine. ES is not the mode TFP is trying to support with this library, though I guess we're open to it. If you want M>1 for this code you will be inviting M times as many matrix multiplies (or conv2d, etc). Which is ok. But let's keep the default at M=1, wick only has 2x overhead.

On Fri, Nov 9, 2018, 6:40 AM Brian Patton 🚀 <bjp@google.com wrote:

I would think increasing the data batch has the effect you want. A larger data batch will induce correspondingly more rademacher draws to be multiplied with the fixed scaling drawn once per iteration. I don't remember flipout drawing multiple scales per iteration, but it's been a long time since I read the paper. Is multiple scales indeed what you are after? If so I think you might need something different from my proposal: perhaps a scale for the first half of the minibatch and a scale for the second half (etc). You could use tf.tile of a small number of scales (eg. 2 for half and half) to achieve this.

On Fri, Nov 9, 2018, 6:31 AM Belhal KARIMI <notifications@github.com wrote:

Sorry if I was not clear enough. I was talking about the Monte Carlo batch (the samples drawn from the variational distribution to compute the Flipout estimator) not the batch of data. There is no obvious way to increase that MC batch size apparently

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/213#issuecomment-437333292, or mute the thread https://github.com/notifications/unsubscribe-auth/AVJZI6FZ7PU4V7zaaKcZM-SDhItsvNpIks5utWeigaJpZM4YNUqU .

BelhalK commented 5 years ago

Good morning,

I actually have the same question for layers of type Convolution2DReparameterization The loss function for such layers, at least the intractable part, is approximated by Monte Carlo using the reparametrization trick of Kingma and Welling. I can't find how many Monte Carlo samples are drawn to compute this approximation at each interation in https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/layers/conv_variational.py#L1075.

Could you please help me on this? Belhal

csuter commented 5 years ago

@BelhalK I don't think there is actually a monte carlo contribution to the loss being included. the _apply_divergence function, which accepts a tensor argument, presumably for this purpose, is actually only ever called with that arg omitted (see the lambda here: https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/layers/conv_variational.py#L114)

Instead, the loss is always being computed as an analytic KL between the posterior and prior, which you can discover by tracing backwards from _apply_divergence (https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/layers/conv_variational.py#L401).

It seems the code was designed with the potential to override these, but it's not immediately clear to me what is the right way to do it. E.g., if you override the kernel_posterior_tensor_fn to return multiple samples, it seems to me that will cause problems in _apply_variational_kernel where the tensor is used to compute the forward outputs.

Sorry we don't have a better path forward right now. We are currently considering some ideas to revamp the tfp.layers library, which should bear fruit in the near future.

SiegeLordEx commented 5 years ago

@csuter The KL-divergence is indeed analytical by default, but the log-likelihood term is approximated via random draws.

@BelhalK The answer is actually the same as for the fully connected layer: it's 1 (gaussian) sample + per-example sign flip (courtesy of flipout). I think the easiest way to increase the number is to run the loss multiple times.

# Or use a tf.while_loop
loss = 1.0 / num_samples * tf.add_n([loss_fn(example_batch) for _ in range(num_samples)])

Where loss_fn internally uses these Bayesian layers. As far as I can tell, this is exactly what Edward does in its bayesian_nn example.

csuter commented 5 years ago

Ah I see, but the LL term would be computed in user code, not in the layer itself. Thanks, Pavel!

BelhalK commented 5 years ago

Here is the small section of interest in my code:

with tf.name_scope("bayesian_neural_net", values=[images]):
    neural_net = tf.keras.Sequential([
        tfp.layers.Convolution2DReparameterization(6,
                                        kernel_size=5,
                                        padding="SAME",
                                        activation=tf.nn.relu),
        tf.keras.layers.MaxPooling2D(pool_size=[2, 2],
                                     strides=[2, 2],
                                     padding="SAME"),
        tf.keras.layers.Flatten(),
        tfp.layers.DenseReparameterization(84, activation=tf.nn.relu),
        tfp.layers.DenseReparameterization(10)
        ])

    logits = neural_net(images)
    labels_distribution = tfd.Categorical(logits=logits)

#Loss
neg_log_likelihood = -tf.reduce_mean(labels_distribution.log_prob(labels))
kl = sum(neural_net.losses) / mnist_data.train.num_examples
elbo_loss = neg_log_likelihood + kl

#Optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=FLAGS.learning_rate)
train_op = optimizer.minimize(elbo_loss)

Thanks @SiegeLordEx for your suggestions. I understand the principle of compute the loss many times. How would your solution fits here? Would it be at that following lines?:

neg_log_likelihood = -tf.reduce_mean(labels_distribution.log_prob(labels))
SiegeLordEx commented 5 years ago

Here's one way to do this (be cautious of numerical stability when you do actually implement it, probably want to do a tf.reduce_mean instead of manual running sum I did):

with tf.name_scope("bayesian_neural_net", values=[images]):
    neural_net = ...

elbo_loss = 0.

for i in range(num_samples):
    logits = neural_net(images)
    labels_distribution = tfd.Categorical(logits=logits)

    #Loss
    neg_log_likelihood = -tf.reduce_mean(labels_distribution.log_prob(labels))
    kl = sum(neural_net.losses) / mnist_data.train.num_examples
    elbo_loss += neg_log_likelihood + kl

elbo_loss /= num_samples

#Optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=FLAGS.learning_rate)
train_op = optimizer.minimize(elbo_loss)

The key point here is that you must call neural_net(images) multiple times, as that is what gives you a new random sample per call. Also, in your case, if you're leaving everything at its defaults, you can move the kl_divergence computation outside the loop as that's deterministic and independent of images.

EDIT: Sorry, I had an incorrect suggestion about using a multi-dimensional batch here, it won't work.

SiegeLordEx commented 5 years ago

In terms of doing this 'properly', one possible way this could look like is this:

  1. Add an extra leading dimension to the inputs that equals the number of samples you want to average over.
  2. Alter each layer's kernel_posterior_tensor_fn to return that many samples.

This doesn't quite work right now. As Chris said, however, we're looking into a new abstraction for these layers and we'll definitely consider this case.

BelhalK commented 5 years ago

Thanks a lot for this. I will implement as you suggested. As far as improvements, why not just adding an argument "num_samples" in the function where this loss function is computed right now so that user could define how many Monte Carlo samples to simulate. Also, in your suggestions "Add an extra leading dimension to the inputs that equals the number of samples you want to average over.": I don't see how this relates to the input layer, we are not talking about averaging over input samples but rather about Monte Carlo samples (simulations obtained sampling from the variational candidate to compute an approximation of the loss function)

SiegeLordEx commented 5 years ago

The thing is, 'the function where this loss function is computed right now' is partly in the user's code. I.e. the term that computes the MC average is this one: neg_log_likelihood = -tf.reduce_mean(labels_distribution.log_prob(labels)). The only thing that the layers are responsible for (ignoring the KL term), is generating the reparametrizable weight perturbations. What I'm proposing is letting the layers be able to generate multiple weight perturbations, but right now they're not set up to do that (as Chris pointed out).

As for the discussion about inputs, that's just an implementational detail, perhaps one that we can avoid. Note that it is essential to worry about inputs in some way because when you say 'multiple samples' you implicitly mean running the network multiple times over each input, once for each sample (the for loop in my code snippet makes this explicit). If you're also sampling inputs independently from the weights, you won't be gaining as much variance reduction as you could (flipout is already doing something like this).

xht033 commented 5 years ago

153

DimitrisCC commented 3 years ago

I have a similar issue. For Dense layers, I just augment the input like this: [num_particles] + input.shape, and for the weights, using the reparam trick: W = w_mu + w_std*tf.random_normal([num_particles] + [dim_in, dim_out]), where w_mu and w_std are trainable variables of shape [dim_in, dim_out]. Then a tf.matmul(input_aug, W) does the trick (the output maintains the num_particles dim). So basically, it's the same weights distribution for the minibatch, but multiple different samples of it.

I want the same thing, but on 2d convolutions. Seems that "faking" the extra dim with tf.nn.conv3d isn't working as intended.

Anyone got any tested ideas?