Closed botev closed 6 years ago
A working example of this can be seen here. If you guys think this is reasonable I can make a PR for it as well.
Nice idea, and I understand the motivation. What I am wondering is whether this does maximize ease of use, since the only difference of programming the posterior is that you may need to pass 'x' through the observed
argument instead of a separate one.
Despite this I'm still happy to merge it as an optional feature for people who love the symmetry :) Before that let me check how you deal with the shape things.
Yes, the shapes I'm not 100% sure are correct so having a look is fine.
Another use case, which I came across just now is to have a Delta
distribution (or also a marker). This is to be able to pull some of the deterministic computation from inside the model. As an example, some of the other Normalizing Flows (householder for instance) create the flow by usually using the "top deterministic" layer before the z_logits
. This additionally can make GANs a lot more natural where the sampling epsilon and the final variable are inside the BayesianNet
and are both retrievable. As an example (I specifically use it now for vae):
@zs.reuse('variational')
def q_net(observed, x_dim, z_dim, n_x, n_z_per_x):
with zs.BayesianNet(observed=observed) as variational:
x = zs.Empirical('x', (n_x, x_dim), dtype=tf.int32)
lz_x = layers.fully_connected(tf.to_float(x), 500)
lz_x = layers.fully_connected(lz_x, 500)
h = zs.Delta("h", h)
z_mean = layers.fully_connected(lz_x, z_dim, activation_fn=None)
z_logstd = layers.fully_connected(lz_x, z_dim, activation_fn=None)
z = zs.Normal('z', z_mean, logstd=z_logstd, group_ndims=1,
n_samples=n_z_per_x)
return variational
This now allows me to do:
(z, log_q_z), (h, _) =q_net({"x": x}, x_dim, z_dim, n_x, n_z_per_x) \
.query(["z", "h"], outputs=True, local_log_prob=True)
Using this a GAN would look like:
@zs.reuse('model')
def gan(observed, x_dim, z_dim, n_x, n_z_per_x):
with zs.BayesianNet(observed=observed) as model:
z_mean = tf.zeros([n_x, z_dim])
z = zs.Normal('z', z_mean, std=1., group_ndims=1, n_samples=n_z_per_x)
lx_z = layers.fully_connected(z, 500)
lx_z = layers.fully_connected(lx_z, 500)
lx_z = layers.fully_connected(lx_z, x_dim, activation_fn=None)
x = zs.Delta("x", lx_z)
return model
Note that for continuous variables the log likelihood of the Delta is considered infinte.
I agree this can be an alternative if you want to query some deterministic things through the context. Maybe Deterministic
or Implicit
is better? Since it may not be a delta due to randomness in upstream nodes.
@botev Btw, if you'd like to make the Implicit
node. We may be happy to bring forward the plan on supporting density ratio estimation for implicit distributions (e.g. through a GAN-like discriminator). This will make learning of implicit models easier in ZhuSuan. @ssydasheng is happy to help with this.
Sure, I don't mind what the name is.
Since it may not be a delta due to randomness in upstream nodes.
My interpretation of the name "delta" was as conditional delta, as in the same way that when we write x = zs.Normal
and meaning conditional normal in the graphical model, not marginal.
On that point, are you comfortable with the 0/1 0/inf probability densities for the Implicit
?
How about a NotImplementedError? Users may not expect an inf for their computation. They can be reminded if they try to use the density of an implicit distribution, which is always not a good choice.
Yeah you're right. I mixed up the conditional and marginal. In that sense I agree that Delta
is ok. But I think Implicit
is still preferred considering it's widely used in GAN related papers.
I think that depends on whether x
is generated from a random sample like in GAN, if it is, then Implicit
seems better. If x
is just a fixed tensor, then Delta
seems good.
@ssydasheng I think generally we are talking about things which are fixed, but to depend stochastically on something. E.g. each layer of the GAN is Implicit/Delta
. I don't think 2 distribution are needed.
@thjashin I don't think an Error is a good idea since atm when you query
a model with 2 variable, if you want the log_prob of one of them you will return for both. E.g.:
(z, log_q_z), (h, _) =q_net({"x": x}, x_dim, z_dim, n_x, n_z_per_x) \
.query(["z", "h"], outputs=True, local_log_prob=True)
Will raise an error, while you don't want that. You can issue a warning or alternatively return None
.
That makes sense. None
seems to be a good choice.
Hmm, apparently the None
gives an error from the base method:
@add_name_scope
def log_prob(self, given):
"""
log_prob(given)
Compute log probability density (mass) function at `given` value.
:param given: A Tensor. The value at which to evaluate log probability
density (mass) function. Must be able to broadcast to have a shape
of ``(... + )batch_shape + value_shape``.
:return: A Tensor of shape ``(... + )batch_shape[:-group_ndims]``.
"""
given = self._check_input_shape(given)
log_p = self._log_prob(given)
return tf.reduce_sum(log_p, tf.range(-self._group_ndims, 0))
Since the reduce_sum
is called on the None
. I can either modify the base method as well or go back to the infinity log probabilities.
I'm not sure which is better though. @ssydasheng @cjf00000 @korepwx @miskcoo Which type of log_prob you prefer for implicit/delta distributions? None or inf?
I prefer inf with a warning
0/inf seems good.
Ok, I will implement that and make a PR. A similar issue, which would be nice to also solve, is to have a Reparametrizable distribution. This will allow anything that is technically a Normalizing Flow to be part of the model as well. I would suggest the interface for that to be something like:
z = zs.Normal("z", ...)
f_z, log_det = func(z)
z_t = zs.Reparametrizable("z_t", f_z, log_det)
It will not allow passing num_samples
. This way "I think" that it will work out of the box correctly if I understand correctly how you use the models to bootstrap them.
Cool, thanks.
As for Reparameterizable
, Previously we have discussed about this but finally decided to not support it (at least not the first priority). The main reason is that to implement this, only pass func
and log_det
is not enough. you have to build a bijector, which can do the inverse func^{-1}
to evaluate the density at a given value. We feel all these arguments (bijector
, log_det
) have made the feature useless because users are required to provide everything and the library only wrap the basic computation in a function. That's why we finally provided a simple implementation of normalizing flow.
So I do agree that technically you need a bijective function. However, if you restrict, at least for now as it is not a priority, that the Reparameterizable
cannot be part of the observed variables or maybe that if it is observed you can not query for the "root" latent and the log-probability of any of those.
That might not be too easy I agree, let me have a think about it and maybe if I come with some nicer way of doing this I'll make an example and give a proposition.
Yep. Some insights are really needed on this feature.
Ok, so I think the main issue is that in most NF we return the "samples" and the "log_det" simultaneously - that is pretty much the only way to compute stuff efficiently. This might be a breaking change and is worth considering, however - add a method sample_and_log_prob
to the base Distribution
class, which by default calls sample
and then log_prob
. When users call query you will now have to check for each variable if they request both to call this method. This would make every existing code backwards compatible. It will allow creating a new distribution which supports querying the log_prob
only through that method. This would also not require an inverse model. That can be added later where you have both forward and inverse model.
Another option is the model to have similar to self._tensor
a self._local_log_prob
which to facilitate this in a similar fashion. This, in fact, might be easier.
So with the second suggestion the normalizing flow example looks like this:
def q_net(x, z_dim, n_particles, n_planar_flows):
with zs.BayesianNet() as variational:
lz_x = tf.layers.dense(tf.to_float(x), 500, activation=tf.nn.relu)
lz_x = tf.layers.dense(lz_x, 500, activation=tf.nn.relu)
z_mean = tf.layers.dense(lz_x, z_dim)
z_logstd = tf.layers.dense(lz_x, z_dim)
def flow(samples, log_samples):
return zs.planar_normalizing_flow(samples, log_samples,
n_iters=n_planar_flows)
z = zs.NormalFlow('z', flow,
z_mean, logstd=z_logstd, group_ndims=1, n_samples=n_particles)
return variational
All of the change that were required can be viewed here: https://github.com/botev/zhusuan/commit/4873d6e990e93ca2bc14d6626528de41e17623aa
PS: I also would suggest the NF to return only the log_det so that you don't pass the base log_probability. As there are cases where you just want to use the function form of the flow and if we have an inverse it can't use log_prob as input.
I think the key point here is that you suggest making Normalizing Flow a specific distribution. So in that way an error can be raised when its log_prob
is called. This is good to have. But I feel it is better to have sample_and_log_prob
only implemented in the flow distribution, because in the current implementation you construct log_prob
related graphs in situations where users may only want tensor
.
Doesn't tensorflow skip those graphs during the computation as if the user does not need them they won't be evaluated? Also, note that this can be easily side-stepped by making it internally a closure.
Another option is as you mentioned to have this only for Flow Distribution and have a specific case in the query method.
Ok, so I think your suggestion is good. I've also implemented this here: https://github.com/botev/zhusuan/commit/94e43717ddafc36820657377848d9fb5da7a0357. I think this approach is good as it addresses both. I've also added code for an optional inverse model which allows for calculating log_prob if ever needed. Note that I don't think there is any way of not creating the log_det
graph for the FlowDistribution
(also not many use cases when that is the case as well).
However, it is required that we set in the stone the interface to the forward and inverse model and I do suggest and think it is better to return just the log_det
with these rather than the sum log_x0 - log_det
.
Well, actually introducing an inverse model is not necessary for normalizing flow (e.g., for the planar flow the inverse method is not in closed-form). So I suggest we leave it later to be part of the TransformedDistribution
work, for which it is much harder to form a good API. For the normalizing flow distribution, another thing is about shapes. Note that NF should only be applied to distributions of value_shape
[]
and it depends on the user how many dimensions of the batch_shape
they will consider as a group. So instead of applying to the last dimension of batch_shape
, we should take group_ndims
into consideration.
Note that I don't think there is any way of not creating the log_det graph for the FlowDistribution (also not many use cases when that is the case as well).
I mean you do this for all distributions because the tensor
property is implemented by _tensor_and_log_prob
.
Maybe we could leave sample_and_log_prob
as a method in base Distribution
which is by default implemented by directly passing samples to log_probs, and rewritten by the FlowDistribution
.
So in the second implementation variant here: https://github.com/botev/zhusuan/commit/94e43717ddafc36820657377848d9fb5da7a0357 there are a few things:
I create _local_log_prob
when .tensor
is called only for the FlowDistribution. For any other distribution, it is created only if you explicitly request .local_log_prob
otherwise, the graphs are not constructed.
The inverse model is left optional (e.g. None by default) for the FlowDistribution. If you call log_prob
and it is None
it raises an exception, otherwise calculates the probability accordingly.
I've left the sample_and_log_prob
to exist only in FlowDistribution so that 1. is achievable.
As for the shapes could you maybe give me an example cause I'm not sure I understand the issue?
I spent some time thinking about this and have an improved version based on the second implementation.
For base distribution
def sample_and_log_prob(self):
samples = self.sample()
log_p = self.log_prob(samples)
return samples, log_p
By default it will call sample and then log_prob
For FlowDistribution,
def sample_and_log_prob(self):
samples, log_p = self.base_dist.sample_and_log_prob(samples, log_p)
samples, log_p = planar_normalizing_flow(samples, log_p, self.n_flows)
return samples, log_p
def _sample(self):
# Maybe a specialized error is better
raise NotImplementedError()
def _log_prob(self):
raise NotImplementedError()
It was rewritten to use the forward function.
Then in the base StochasticTensor.
@property
def tensor(self):
try:
self.tensor = self._distribution.sample()
except NotImplementedError:
self.tensor, self.local_log_prob = self._distribution.sample_and_log_prob()
How do you like this? This will remove code about a specific flow distribution in the base classes.
Yes, that does sound good to me and I was thinking as well to add exception handling rather than a check. One thing, however, I do really suggest that the flow
has the interface:
samples, log_det_j = flow(samples, **kwargs)
And so than following your suggestion in the FlowDistribution:
def sample_and_log_prob(self):
samples, log_p = self.base_dist.sample_and_log_prob(samples, log_p)
samples, log_det_j = planar_normalizing_flow(samples, self.n_flows)
return samples, log_p - log_det_j
The reason being is that if we add an inverse model where we observe the z_t
there is no log_p
to pass in. Other than that if you also are happy with that I can modify my implementation and make another PR for that.
Yep. It will be more consistent if all things in the transform module could have (samples, log_det) returned.
I know this might be a bit strange, but I think it would be useful at least to have an option of an empirical distribution. This would make for instance specifying the "inference" networks more symmetric and include more clearly what the full graphical model is. This will also make it consistent to pass the
x
as an observed variable when we query for the log-probabilities rather than when we build model. This additionally makes the way of building the forward and backward models having the exact same signatures. Taking the example: