Closed gdikov closed 5 years ago
Hi Georgi,
thanks for pointing that out! I'm aware of the issue and but haven't had the chance to update it online yet. Overall I found that this does not affect the performance of the approach.
I am working on extending the work for a full paper and planned to update the code as soon as this is done. But as a preview I can say that this approach seems to also work reasonable with one hypernet per layer and a single noise variable. This also scales okayish to ResNets.
I am happy to accept a PR for this but I guess I might put a note on the ReadMe that an update will follow. I am happy to discuss specifics though :)
Hi Nick,
I would gladly open a PR but, unfortunately, I was not able to reproduce the results in the notebook even after playing with the network architectures, learning rates and training schedules. I will take some more time to find a balanced parametrisation for the discriminator and generator and will then open the PR.
Yesterday, however, I have implemented my own version of the Bayes By Hypernetworks (BBH) model, based on your implementation, and I would like to share some interesting findings. I ran it on a toy regression dataset, for which my Bayes By Backprop (https://arxiv.org/pdf/1505.05424.pdf) implementation failed to produce nice uncertainty estimates in the interesting regions whereas BBH did. In onrder to make it work though, I was much more modest in the networks' parametrisation than your original scripts.
Bayes By Backprop | Bayes By Hypernet |
---|---|
Actually, in the most papers about Bayesian neural networks I have looked at, people show predictive uncertainty on toy datasets which are in a way connected and all of them look good. I was surprised to discover, however, the disability of BBB to produce increased uncertainty estimates in the space between the two groups of samples. Maybe my intuition is wrong, which tells me to expect that... Any thoughts on that?
Sorry for the off-topic and best regards, Georgi
Hi @gdikov
Many thanks for your comment.
Would you mind if you send me the code you used to generate the toy dataset? I will try it on my BBB code and compare to yours.
Hi @bashhwu,
Sure I would:
import numpy as np
rng = np.random.RandomState()
def generate_toy(data_size=1000, noise_std=.02):
def f(x):
return (0.3*x + rng.normal(0, noise_std, size=x.shape)
+ 0.3*np.sin(2*np.pi*x + rng.normal(0, noise_std, size=x.shape))
+ 0.3*np.sin(4*np.pi*x + rng.normal(0, noise_std, size=x.shape)))
data_x = np.concatenate([np.linspace(-0.5, -0.25, data_size // 2),
np.linspace(0.0, 0.25, data_size // 2)]).reshape(-1, 1)
data_y = f(data_x)
return data_x, data_y
Btw, f
is overly complicated as it was originally taken from the Blundell's BBB paper but it doesn't really matter -- it serves its purpose.
Thanks for the code, @gdikov. I will display the result here once finished.
The figure below shows the result of the predictive uncertainty when the toy dataset (y=x**3) is applied to the BBB model for 50 training data points. Unfortunately, the sns.tspllot is unable to display the uncertainty intervals clearly due to low variance data over the samples drawn from the variational distributions. Therefore, I preferred to plot them separately.
Do you think I should blame the model?
Hi @gdikov,
Could you please provide me with the structure of your BBB network so I can make a fair comparison?
Well, I don't now your prior, but I think that for 50 data points (and assuming large enough number of parameters) you should be having quite a significant pull from the KL term and hence your posterior variance shouldn't shrink as much. Maybe something is wrong with the loss formulation?
My BBB network wasn't anything special. Basically anything simple will do the trick for this dataset, e.g. one layer of 50 ReLU units or two will be more than enough. Notice that I trained it on 1000 points and the variance is quite small. With only 50 points I would get a much more entropic predictive distribution.
Hi @gdikov, The used prior is N(0,1)
After initializing the model parameters with different values, I get this figure:
regarding your comment about the KL term, I used both Analytical term and the one used by the authors of the original paper. The analytical one does not help.
For the toy dataset you gave above, it seems that my model does not learn the data. The same model works for classification tasks with MNIST dataset.
Great to see your discussion here. I just updated all the code for an updated Arxiv version that should come online soon.
As for BbB, the initialisation is quite important. @gdikov I would love to see your BbH code and see how you parametrise the network and especially how you frame the KL estimation as I've found the KL to be quite unstable and the new kernel method to work much better.
Hi @pawni,
I would gladly open-source my implementations once I finish my master thesis and accord it with my supervisor. I will update you on that. I am also curious to see your new version of the Implicit Weight Uncertainty paper. As far as I can see, it is not updated yet, right? I will check it frequently.
Cool! It should be out on Monday or Tuesday. But feel free to shoot me an email and I can send you a version there :)
@gdikov
Using Gaussian Process
Hi @pawni,
In the ipython notebook of the toy example you have forgotten to connect the layers of the hypernetwork with each other. The function I am talking about is:
You should correct
w_z
toz
in the the linesz = tf.layers.dense(inputs=w_z, units=256)
andz = tf.layers.dense(inputs=w_z, units=100)
. I guess it was a copy-paste issue but it has the unfortunate consequence of specifying the hypernetwork as a linear function of the noise! A similar bug is also present in the MNIST notebook, where the last linear dense layer is connected to the first instead of the previous one.Best, Georgi