Variational inference with implicit probabilistic models

Hi, everyone. I'm not sure if I am at the right place to ask, but I'll try it anyways :-)

I've been trying to implement the algorithm proposed in Tran et al. (2017) - "Hierarchical Implicit Models and Likelihood-Free Variational Inference" (arXiv) and I admit I'm having quite some trouble to get it done.

I have found a piece of code that actually implements the implicit KL training proposed in the paper implicit_klqp, but it is done on Edward1 and AFAIK there's no accompanying script showing how to use the function in practice.

Since the project for Edward1 seems quite inactive nowadays, I was wondering whether people over here could help me out to find a way of using the algorithm proposed in the paper that I mentioned :-)

Cheers, Pedro

Hey @plcrodrigues ! Thanks for reaching out.

Indeed, that implicit_klqp.py is outdated as it's TF1. It's not too difficult to translate it into a new DL framework like TF2 though.

Inputs: qbeta and beta are the global variables in the model (what forms the hierarchy in hierarchical implicit models). q means variational approximation. qz and z are the local variables in the model (what is per-example in hierarchical implicit models). discriminator is the ratio estimator that takes all these variables as input and data in order to determine whether samples can be easily discriminated.
build_loss_and_gradients builds the actual loss for taking gradients with respect to all the parameters involved (parameters in the variational approximation, potentially any model parameters in the original model, and the discriminator parameters).

If you've taken a stab at trying to build on the code for your use case, I'd be happy to take a look.

Hi Dustin, thank you very much for your help!

I must say that I am not very familiar with TF2 (nor TF1 for that matter, I use mostly pyTorch for my research) and my intention was really to get a hand on the method that you proposed in your very interesting NeurIPS 2017 paper. With that said, I was wondering if I could share with you how my code is going in Edward1 using ed.ImplicitKLqp and have your opinion on whether things look OK.

In fact, my main concern is that I am not sure whether the rather bad results that I'm seeing as compared to the method that I propose are coming from the natural instability of the GAN-like training in ed.ImplicitKLqp or somewhere else (e.g. bad choice of approximators).

Problem formulation

I have N observations yi which are all of the form yi = w * xi + ei. Parameter w is a (scalar) global variable and the xi are local (scalar) variables that I would like to estimate as well. The ei are Gaussian perturbations with small standard deviation. I'm considering that the prior distributions for all variables (local and global) are Uniform in [0, 1].

I know, you want to say that this problem does not really require a LFI approach like the one in Tran2017 and indeed is the case. However, this simple toy model serves as a validation step for further work that I'm developing where the likelihood-free approach is indeed necessary (i.e. the data is generated from complex simulators). This is why I am considering the LFVI approach from your paper.

The code (in Edward1)

import edward as ed
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

from edward.models import Normal, Uniform

tf.flags.DEFINE_integer("N", default_value=100, docstring="Number of data points.")
FLAGS = tf.flags.FLAGS

def build_toy_dataset(x, w):
    N = len(x)
    y = x * w + np.random.normal(0.00, 0.05, size=N)
    return y

def ratio_estimator(data, local_vars, global_vars):
    input = tf.concat([
        tf.reshape(data[y], [FLAGS.N, 1]),
        tf.reshape(local_vars[x], [FLAGS.N, 1]),
        tf.tile(tf.reshape(global_vars[w], [1, 1]), [FLAGS.N, 1])], 1)
    hidden = tf.layers.dense(input, 64, activation=tf.nn.relu)
    output = tf.layers.dense(hidden, 1, activation=None)
    return output  

ed.set_seed(42)

# Create data
x_true = np.r_[0.5, np.random.rand(FLAGS.N - 1)]
w_true = np.array([0.5])
y_train = build_toy_dataset(x_true, w_true)

# Create the model
x = Uniform(low=tf.zeros(FLAGS.N), high=tf.ones(FLAGS.N))
w = Uniform(low=tf.zeros(1), high=tf.ones(1))
y = Normal(loc=x*w, scale=0.05*tf.ones(FLAGS.N))

# Instantiate the approximations
qx = Normal(loc=tf.get_variable("qx/loc", [FLAGS.N]),
            scale=tf.nn.softplus(tf.get_variable("qx/scale", [FLAGS.N])))
qw = Normal(loc=tf.get_variable("qw/loc", [1]),
            scale=tf.nn.softplus(tf.get_variable("qw/scale", [1])))

# Run both inference trainings and get samples
sess = ed.get_session()

# Get samples from prior
n_samples = 10_000
samples_w_prior = w.sample(n_samples).eval()
samples_x0_prior = x.sample(n_samples)[:,:1].eval()

# Instantiate the inference object
inference = ed.ImplicitKLqp({w: qw, x: qx}, data={y: y_train}, 
                            discriminator=ratio_estimator, 
                            global_vars={w: qw})

# 1. Initialize algorithm via `initialize`.
inference.initialize(n_iter=5000)
tf.global_variables_initializer().run()

# 2. Run `update` for `inference.n_iter` iterations
for _ in range(inference.n_iter):
    for _ in range(5):
        info_dict_d = inference.update(variables="Disc")            
    info_dict = inference.update(variables="Gen")
    info_dict['loss_d'] = info_dict_d['loss_d']
    info_dict['t'] = info_dict['t'] // 6  # say set of 6 updates is 1 iteration        
    inference.print_progress(info_dict)

# 3. Finalize algorithm via `finalize`
inference.finalize()    

# Get samples from posterior estimated via LFVI
n_samples = 10_000
samples_w_posterior = qw.sample(n_samples).eval()
samples_x0_posterior = qx.sample(n_samples)[:,:1].eval()

Results

The figure below shows the histograms for the samples of the posterior approximation of the global parameter w and for the local parameter x0. I'm considering N = 100. Please note that the results have a very large variability too. Everytime I run the training procedure with a different seed, I get something quite different.

Figure_1

Questions

1) Does the code look OK in terms of training the inference procedure and setting up all the rest? (ratio estimator, etc.) 2) Should I be really expecting this large variability in the results with the LFVI procedure?

Thank you very much! Pedro cc @agramfort @tommoral @

Thanks for sharing! If you're converting a version for your own likelihood-free experiments in pytorch or tf2, I'd be curious to take a look.

Indeed, that code looks correct to me. If you want to test your specific hypothesis that the ratio estimator is unstable, you can actually compute the ground truth ratio estimator and see how off it is across training. That's the advantage of starting with a toy model like linear regression. You'll probably need to tune the total numbre of training steps, the number of discriminator steps in the inner loop, the underlying learning rate in Adam, and maybe even the architecture of the ratio estimator.
If the variability changes so much across seed, my guess is that the models haven't finished training. For debugging, one thing you can do is update the ratio estimator for a ton of steps before you do each outer loop update. This ensures that the loss function for LFVI is pretty much VI (subject to the ratio estimator being slightly off).

Hi @dustinvtran, thank you very much for your answers. I debugged things a little in the past few days and noticed that indeed the training required more iterations on the inner loop (50 instead of 5) to have more steady results. Now I do have something that is quite stable along different seeds and for which the training loss is stable. However, I still have some questions :-)

Question 1: are local parameters indeed hard to grasp?

From what I can see now, the global parameter seems to be rather well estimated when more and more observations are available (as expected). However, the local parameter is really off at all times. The curves below portray the Wasserstein distance of the marginal distribution for w and for x0 with respect to Dirac distributions on the ground truth values for each parameter (this is a measure of how sharp are the posterior distributions + how far we are from the ground truth values). I am plotting the median distances for different choices of ground truth.

Figure_2

Would you have any thoughts on this result?

If I am not mistaken, the original paper of Tran et al. 2017 (https://arxiv.org/abs/1702.08896) have no examples in which the local parameters are estimated. Moreover, the example from where I am basing my code bayesian_linear_regression_implicitklqp.py also considers only global variables. With that said, I am not sure of whether the result that I have is expected. Maybe local parameters are just hard to grasp?

Question 2: how to debug the ratio estimator?

I think I understood what you meant with checking how the ratio estimator was doing after the training. I guess this has to do with Appendix F of the Tran et al. 2017 paper, right? However, I am not sure of how to do this with Edward... :-/

I do have the expression for log p(y_n | x_n, w) = 1/sigma^2 (y - x_n*w)^2 but I am not sure of which information from the inference procedure I should track to assess whether the ratio estimator is expressive enough.

google / edward2