Abstraction of empirical distribution.

botev commented 6 years ago

I know this might be a bit strange, but I think it would be useful at least to have an option of an empirical distribution. This would make for instance specifying the "inference" networks more symmetric and include more clearly what the full graphical model is. This will also make it consistent to pass the x as an observed variable when we query for the log-probabilities rather than when we build model. This additionally makes the way of building the forward and backward models having the exact same signatures. Taking the example:

@zs.reuse('model')
def vae(observed, x_dim, z_dim, n_x, n_z_per_x):
    with zs.BayesianNet(observed=observed) as model:
        z_mean = tf.zeros([n_x, z_dim])
        z = zs.Normal('z', z_mean, std=1., group_ndims=1, n_samples=n_z_per_x)
        lx_z = layers.fully_connected(z, 500)
        lx_z = layers.fully_connected(lx_z, 500)
        x_logits = layers.fully_connected(lx_z, x_dim, activation_fn=None)
        x = zs.Bernoulli('x', x_logits, group_ndims=1)
    return model, x_logits

@zs.reuse('variational')
def q_net(observed, x_dim, z_dim, n_x, n_z_per_x):
    with zs.BayesianNet(observed=observed) as variational:
        x = zs.Empirical('x', (n_x, x_dim), dtype=tf.int32)
        lz_x = layers.fully_connected(tf.to_float(x), 500)
        lz_x = layers.fully_connected(lz_x, 500)
        z_mean = layers.fully_connected(lz_x, z_dim, activation_fn=None)
        z_logstd = layers.fully_connected(lz_x, z_dim, activation_fn=None)
        z = zs.Normal('z', z_mean, logstd=z_logstd, group_ndims=1,
                      n_samples=n_z_per_x)
    return variational

botev commented 6 years ago

A working example of this can be seen here. If you guys think this is reasonable I can make a PR for it as well.

thjashin commented 6 years ago

Nice idea, and I understand the motivation. What I am wondering is whether this does maximize ease of use, since the only difference of programming the posterior is that you may need to pass 'x' through the observed argument instead of a separate one.

Despite this I'm still happy to merge it as an optional feature for people who love the symmetry :) Before that let me check how you deal with the shape things.

botev commented 6 years ago

Yes, the shapes I'm not 100% sure are correct so having a look is fine.

Another use case, which I came across just now is to have a Delta distribution (or also a marker). This is to be able to pull some of the deterministic computation from inside the model. As an example, some of the other Normalizing Flows (householder for instance) create the flow by usually using the "top deterministic" layer before the z_logits. This additionally can make GANs a lot more natural where the sampling epsilon and the final variable are inside the BayesianNet and are both retrievable. As an example (I specifically use it now for vae):

@zs.reuse('variational')
def q_net(observed, x_dim, z_dim, n_x, n_z_per_x):
    with zs.BayesianNet(observed=observed) as variational:
        x = zs.Empirical('x', (n_x, x_dim), dtype=tf.int32)
        lz_x = layers.fully_connected(tf.to_float(x), 500)
        lz_x = layers.fully_connected(lz_x, 500)
        h = zs.Delta("h", h)
        z_mean = layers.fully_connected(lz_x, z_dim, activation_fn=None)
        z_logstd = layers.fully_connected(lz_x, z_dim, activation_fn=None)
        z = zs.Normal('z', z_mean, logstd=z_logstd, group_ndims=1,
                      n_samples=n_z_per_x)
    return variational

This now allows me to do:

(z, log_q_z), (h, _) =q_net({"x": x}, x_dim, z_dim, n_x, n_z_per_x) \
                .query(["z", "h"], outputs=True, local_log_prob=True)

Using this a GAN would look like:

@zs.reuse('model')
def gan(observed, x_dim, z_dim, n_x, n_z_per_x):
    with zs.BayesianNet(observed=observed) as model:
        z_mean = tf.zeros([n_x, z_dim])
        z = zs.Normal('z', z_mean, std=1., group_ndims=1, n_samples=n_z_per_x)
        lx_z = layers.fully_connected(z, 500)
        lx_z = layers.fully_connected(lx_z, 500)
        lx_z = layers.fully_connected(lx_z, x_dim, activation_fn=None)
        x = zs.Delta("x", lx_z)
    return model

Note that for continuous variables the log likelihood of the Delta is considered infinte.

thjashin commented 6 years ago

I agree this can be an alternative if you want to query some deterministic things through the context. Maybe Deterministic or Implicit is better? Since it may not be a delta due to randomness in upstream nodes.

thjashin commented 6 years ago

@botev Btw, if you'd like to make the Implicit node. We may be happy to bring forward the plan on supporting density ratio estimation for implicit distributions (e.g. through a GAN-like discriminator). This will make learning of implicit models easier in ZhuSuan. @ssydasheng is happy to help with this.

botev commented 6 years ago

Sure, I don't mind what the name is.

Since it may not be a delta due to randomness in upstream nodes.

My interpretation of the name "delta" was as conditional delta, as in the same way that when we write x = zs.Normal and meaning conditional normal in the graphical model, not marginal.

On that point, are you comfortable with the 0/1 0/inf probability densities for the Implicit?

thjashin commented 6 years ago

How about a NotImplementedError? Users may not expect an inf for their computation. They can be reminded if they try to use the density of an implicit distribution, which is always not a good choice.

thjashin commented 6 years ago

Yeah you're right. I mixed up the conditional and marginal. In that sense I agree that Delta is ok. But I think Implicit is still preferred considering it's widely used in GAN related papers.

ssydasheng commented 6 years ago

I think that depends on whether x is generated from a random sample like in GAN, if it is, then Implicit seems better. If x is just a fixed tensor, then Delta seems good.

botev commented 6 years ago

@ssydasheng I think generally we are talking about things which are fixed, but to depend stochastically on something. E.g. each layer of the GAN is Implicit/Delta. I don't think 2 distribution are needed.

@thjashin I don't think an Error is a good idea since atm when you query a model with 2 variable, if you want the log_prob of one of them you will return for both. E.g.:

(z, log_q_z), (h, _) =q_net({"x": x}, x_dim, z_dim, n_x, n_z_per_x) \
                .query(["z", "h"], outputs=True, local_log_prob=True)

Will raise an error, while you don't want that. You can issue a warning or alternatively return None.

thjashin commented 6 years ago

That makes sense. None seems to be a good choice.

botev commented 6 years ago

Hmm, apparently the None gives an error from the base method:

@add_name_scope
    def log_prob(self, given):
        """
        log_prob(given)

        Compute log probability density (mass) function at `given` value.

        :param given: A Tensor. The value at which to evaluate log probability
            density (mass) function. Must be able to broadcast to have a shape
            of ``(... + )batch_shape + value_shape``.
        :return: A Tensor of shape ``(... + )batch_shape[:-group_ndims]``.
        """
        given = self._check_input_shape(given)
        log_p = self._log_prob(given)
        return tf.reduce_sum(log_p, tf.range(-self._group_ndims, 0))

Since the reduce_sum is called on the None. I can either modify the base method as well or go back to the infinity log probabilities.

thjashin commented 6 years ago

I'm not sure which is better though. @ssydasheng @cjf00000 @korepwx @miskcoo Which type of log_prob you prefer for implicit/delta distributions? None or inf?

ssydasheng commented 6 years ago

I prefer inf with a warning

miskcoo commented 6 years ago

0/inf seems good.

botev commented 6 years ago

Ok, I will implement that and make a PR. A similar issue, which would be nice to also solve, is to have a Reparametrizable distribution. This will allow anything that is technically a Normalizing Flow to be part of the model as well. I would suggest the interface for that to be something like:

z = zs.Normal("z", ...)
f_z, log_det = func(z)
z_t = zs.Reparametrizable("z_t", f_z, log_det)

It will not allow passing num_samples. This way "I think" that it will work out of the box correctly if I understand correctly how you use the models to bootstrap them.

thjashin commented 6 years ago

Cool, thanks.

As for Reparameterizable, Previously we have discussed about this but finally decided to not support it (at least not the first priority). The main reason is that to implement this, only pass func and log_det is not enough. you have to build a bijector, which can do the inverse func^{-1} to evaluate the density at a given value. We feel all these arguments (bijector, log_det) have made the feature useless because users are required to provide everything and the library only wrap the basic computation in a function. That's why we finally provided a simple implementation of normalizing flow.

botev commented 6 years ago

So I do agree that technically you need a bijective function. However, if you restrict, at least for now as it is not a priority, that the Reparameterizable cannot be part of the observed variables or maybe that if it is observed you can not query for the "root" latent and the log-probability of any of those.

That might not be too easy I agree, let me have a think about it and maybe if I come with some nicer way of doing this I'll make an example and give a proposition.

thjashin commented 6 years ago

Yep. Some insights are really needed on this feature.

botev commented 6 years ago

Ok, so I think the main issue is that in most NF we return the "samples" and the "log_det" simultaneously - that is pretty much the only way to compute stuff efficiently. This might be a breaking change and is worth considering, however - add a method sample_and_log_prob to the base Distribution class, which by default calls sample and then log_prob. When users call query you will now have to check for each variable if they request both to call this method. This would make every existing code backwards compatible. It will allow creating a new distribution which supports querying the log_prob only through that method. This would also not require an inverse model. That can be added later where you have both forward and inverse model.

Another option is the model to have similar to self._tensor a self._local_log_prob which to facilitate this in a similar fashion. This, in fact, might be easier.

botev commented 6 years ago

So with the second suggestion the normalizing flow example looks like this:

def q_net(x, z_dim, n_particles, n_planar_flows):
    with zs.BayesianNet() as variational:
        lz_x = tf.layers.dense(tf.to_float(x), 500, activation=tf.nn.relu)
        lz_x = tf.layers.dense(lz_x, 500, activation=tf.nn.relu)
        z_mean = tf.layers.dense(lz_x, z_dim)
        z_logstd = tf.layers.dense(lz_x, z_dim)

        def flow(samples, log_samples):
            return zs.planar_normalizing_flow(samples, log_samples,
                                              n_iters=n_planar_flows)
        z = zs.NormalFlow('z', flow,
                          z_mean, logstd=z_logstd, group_ndims=1, n_samples=n_particles)
    return variational

All of the change that were required can be viewed here: https://github.com/botev/zhusuan/commit/4873d6e990e93ca2bc14d6626528de41e17623aa

PS: I also would suggest the NF to return only the log_det so that you don't pass the base log_probability. As there are cases where you just want to use the function form of the flow and if we have an inverse it can't use log_prob as input.

thjashin commented 6 years ago

I think the key point here is that you suggest making Normalizing Flow a specific distribution. So in that way an error can be raised when its log_prob is called. This is good to have. But I feel it is better to have sample_and_log_prob only implemented in the flow distribution, because in the current implementation you construct log_prob related graphs in situations where users may only want tensor.

botev commented 6 years ago

Doesn't tensorflow skip those graphs during the computation as if the user does not need them they won't be evaluated? Also, note that this can be easily side-stepped by making it internally a closure.

Another option is as you mentioned to have this only for Flow Distribution and have a specific case in the query method.

botev commented 6 years ago

Ok, so I think your suggestion is good. I've also implemented this here: https://github.com/botev/zhusuan/commit/94e43717ddafc36820657377848d9fb5da7a0357. I think this approach is good as it addresses both. I've also added code for an optional inverse model which allows for calculating log_prob if ever needed. Note that I don't think there is any way of not creating the log_det graph for the FlowDistribution (also not many use cases when that is the case as well).

However, it is required that we set in the stone the interface to the forward and inverse model and I do suggest and think it is better to return just the log_det with these rather than the sum log_x0 - log_det.

thjashin commented 6 years ago

Well, actually introducing an inverse model is not necessary for normalizing flow (e.g., for the planar flow the inverse method is not in closed-form). So I suggest we leave it later to be part of the TransformedDistribution work, for which it is much harder to form a good API. For the normalizing flow distribution, another thing is about shapes. Note that NF should only be applied to distributions of value_shape [] and it depends on the user how many dimensions of the batch_shape they will consider as a group. So instead of applying to the last dimension of batch_shape, we should take group_ndims into consideration.

Note that I don't think there is any way of not creating the log_det graph for the FlowDistribution (also not many use cases when that is the case as well).

I mean you do this for all distributions because the tensor property is implemented by _tensor_and_log_prob.

Maybe we could leave sample_and_log_prob as a method in base Distribution which is by default implemented by directly passing samples to log_probs, and rewritten by the FlowDistribution.

botev commented 6 years ago

So in the second implementation variant here: https://github.com/botev/zhusuan/commit/94e43717ddafc36820657377848d9fb5da7a0357 there are a few things:

I create _local_log_prob when .tensor is called only for the FlowDistribution. For any other distribution, it is created only if you explicitly request .local_log_prob otherwise, the graphs are not constructed.
The inverse model is left optional (e.g. None by default) for the FlowDistribution. If you call log_prob and it is None it raises an exception, otherwise calculates the probability accordingly.
I've left the sample_and_log_prob to exist only in FlowDistribution so that 1. is achievable.

As for the shapes could you maybe give me an example cause I'm not sure I understand the issue?

thjashin commented 6 years ago

I spent some time thinking about this and have an improved version based on the second implementation.

For base distribution

def sample_and_log_prob(self):
    samples = self.sample()
    log_p = self.log_prob(samples)
    return samples, log_p

By default it will call sample and then log_prob

For FlowDistribution,

def sample_and_log_prob(self):
    samples, log_p = self.base_dist.sample_and_log_prob(samples, log_p)
    samples, log_p = planar_normalizing_flow(samples, log_p, self.n_flows)
    return samples, log_p

def _sample(self):
    # Maybe a specialized error is better
    raise NotImplementedError()

def _log_prob(self):
    raise NotImplementedError()

It was rewritten to use the forward function.

Then in the base StochasticTensor.

@property
def tensor(self):
    try:
         self.tensor = self._distribution.sample()
    except NotImplementedError:
         self.tensor, self.local_log_prob = self._distribution.sample_and_log_prob()

How do you like this? This will remove code about a specific flow distribution in the base classes.

botev commented 6 years ago

Yes, that does sound good to me and I was thinking as well to add exception handling rather than a check. One thing, however, I do really suggest that the flow has the interface:

samples, log_det_j = flow(samples, **kwargs)

And so than following your suggestion in the FlowDistribution:

def sample_and_log_prob(self):
    samples, log_p = self.base_dist.sample_and_log_prob(samples, log_p)
    samples, log_det_j = planar_normalizing_flow(samples, self.n_flows)
    return samples, log_p - log_det_j

The reason being is that if we add an inverse model where we observe the z_t there is no log_p to pass in. Other than that if you also are happy with that I can modify my implementation and make another PR for that.

thjashin commented 6 years ago

Yep. It will be more consistent if all things in the transform module could have (samples, log_det) returned.

thu-ml / zhusuan

Abstraction of empirical distribution. #70