pawni / BayesByHypernet

Code for the paper Implicit Weight Uncertainty in Neural Networks
65 stars 16 forks source link

Incorrectly specified hypernetwork #1

Closed gdikov closed 5 years ago

gdikov commented 6 years ago

Hi @pawni,

In the ipython notebook of the toy example you have forgotten to connect the layers of the hypernetwork with each other. The function I am talking about is:

def get_h_net(num_noise=1):
    with tf.variable_scope('h_net'):

        w1_c = tf.constant([1., 0.])
        w2_c = tf.constant([0., 1.])

        noise = tf.random_normal((num_noise, ))

        w1_z = tf.reshape(tf.concat([w1_c, noise], 0), (1, num_noise + 2))
        w2_z = tf.reshape(tf.concat([w2_c, noise], 0), (1, num_noise + 2))

        w_z = tf.concat([w1_z, w2_z], 0)

        z = tf.layers.dense(inputs=w_z, units=64)
        z = tf.nn.relu(z)

        z = tf.layers.dense(inputs=w_z, units=256)
        z = tf.nn.relu(z)

        z = tf.layers.dense(inputs=w_z, units=100)

        w1 = z[0, :]
        w2 = z[1, :]

        return [w1, w2, tf.reshape(tf.concat([w1, w2], 0), (200, 1))]

You should correct w_z to z in the the lines z = tf.layers.dense(inputs=w_z, units=256) and z = tf.layers.dense(inputs=w_z, units=100). I guess it was a copy-paste issue but it has the unfortunate consequence of specifying the hypernetwork as a linear function of the noise! A similar bug is also present in the MNIST notebook, where the last linear dense layer is connected to the first instead of the previous one.

Best, Georgi

pawni commented 6 years ago

Hi Georgi,

thanks for pointing that out! I'm aware of the issue and but haven't had the chance to update it online yet. Overall I found that this does not affect the performance of the approach.

I am working on extending the work for a full paper and planned to update the code as soon as this is done. But as a preview I can say that this approach seems to also work reasonable with one hypernet per layer and a single noise variable. This also scales okayish to ResNets.

I am happy to accept a PR for this but I guess I might put a note on the ReadMe that an update will follow. I am happy to discuss specifics though :)

gdikov commented 6 years ago

Hi Nick,

I would gladly open a PR but, unfortunately, I was not able to reproduce the results in the notebook even after playing with the network architectures, learning rates and training schedules. I will take some more time to find a balanced parametrisation for the discriminator and generator and will then open the PR.

Yesterday, however, I have implemented my own version of the Bayes By Hypernetworks (BBH) model, based on your implementation, and I would like to share some interesting findings. I ran it on a toy regression dataset, for which my Bayes By Backprop (https://arxiv.org/pdf/1505.05424.pdf) implementation failed to produce nice uncertainty estimates in the interesting regions whereas BBH did. In onrder to make it work though, I was much more modest in the networks' parametrisation than your original scripts.

Bayes By Backprop Bayes By Hypernet
bbb_regression hypernets_regression

Actually, in the most papers about Bayesian neural networks I have looked at, people show predictive uncertainty on toy datasets which are in a way connected and all of them look good. I was surprised to discover, however, the disability of BBB to produce increased uncertainty estimates in the space between the two groups of samples. Maybe my intuition is wrong, which tells me to expect that... Any thoughts on that?

Sorry for the off-topic and best regards, Georgi

bashhwu commented 6 years ago

Hi @gdikov

Many thanks for your comment.

Would you mind if you send me the code you used to generate the toy dataset? I will try it on my BBB code and compare to yours.

gdikov commented 6 years ago

Hi @bashhwu,

Sure I would:

import numpy as np

rng = np.random.RandomState()

def generate_toy(data_size=1000, noise_std=.02):
    def f(x):
        return (0.3*x + rng.normal(0, noise_std, size=x.shape)
                + 0.3*np.sin(2*np.pi*x + rng.normal(0, noise_std, size=x.shape))
                + 0.3*np.sin(4*np.pi*x + rng.normal(0, noise_std, size=x.shape)))

    data_x = np.concatenate([np.linspace(-0.5, -0.25, data_size // 2), 
                             np.linspace(0.0, 0.25, data_size // 2)]).reshape(-1, 1)
    data_y = f(data_x)
    return data_x, data_y

Btw, f is overly complicated as it was originally taken from the Blundell's BBB paper but it doesn't really matter -- it serves its purpose.

bashhwu commented 6 years ago

Thanks for the code, @gdikov. I will display the result here once finished.

The figure below shows the result of the predictive uncertainty when the toy dataset (y=x**3) is applied to the BBB model for 50 training data points. Unfortunately, the sns.tspllot is unable to display the uncertainty intervals clearly due to low variance data over the samples drawn from the variational distributions. Therefore, I preferred to plot them separately.

bbb

Do you think I should blame the model?

bashhwu commented 6 years ago

Hi @gdikov,

Could you please provide me with the structure of your BBB network so I can make a fair comparison?

gdikov commented 6 years ago

Well, I don't now your prior, but I think that for 50 data points (and assuming large enough number of parameters) you should be having quite a significant pull from the KL term and hence your posterior variance shouldn't shrink as much. Maybe something is wrong with the loss formulation?

My BBB network wasn't anything special. Basically anything simple will do the trick for this dataset, e.g. one layer of 50 ReLU units or two will be more than enough. Notice that I trained it on 1000 points and the variance is quite small. With only 50 points I would get a much more entropic predictive distribution.

bashhwu commented 6 years ago

Hi @gdikov, The used prior is N(0,1)

After initializing the model parameters with different values, I get this figure: image

regarding your comment about the KL term, I used both Analytical term and the one used by the authors of the original paper. The analytical one does not help.

For the toy dataset you gave above, it seems that my model does not learn the data. The same model works for classification tasks with MNIST dataset.

pawni commented 6 years ago

Great to see your discussion here. I just updated all the code for an updated Arxiv version that should come online soon.

As for BbB, the initialisation is quite important. @gdikov I would love to see your BbH code and see how you parametrise the network and especially how you frame the KL estimation as I've found the KL to be quite unstable and the new kernel method to work much better.

gdikov commented 6 years ago

Hi @pawni,

I would gladly open-source my implementations once I finish my master thesis and accord it with my supervisor. I will update you on that. I am also curious to see your new version of the Implicit Weight Uncertainty paper. As far as I can see, it is not updated yet, right? I will check it frequently.

pawni commented 6 years ago

Cool! It should be out on Monday or Tuesday. But feel free to shoot me an email and I can send you a version there :)

bashhwu commented 6 years ago

@gdikov image

Using Gaussian Process