Closed Archer666 closed 4 years ago
The law of the unconscious statistician permits expectations over the transformed density to be computed w.r.t the initial density, provided that the transformation is differentiable and monotonic. Normalizing flows satisfy this constraint thanks to their invertibility.
Log p(zk) in line 81 of loss.py computes the log probability of z_k under the prior distribution p. We assume that the prior in our model is a standard normal. In general you don't need to make this assumption. In this code we use a normalizing flow on the posterior q(z|x). As bgroenks96 already pointed out, you can use the law of the unconscious statistician to compute expectations w.r.t. the transformed density by sampling from the untransformed density.
Hi @riannevdberg and @bgroenks96, thank you for the sharing of the code and the explanation.
I have similar questions here. According to my understanding, the objective is Eq. (7) in the paper:
The log p_theta(x, z)
is actually equal to log p_theta(x|z_k) + log p_theta(z_k)
. The first term log p_theta(x|z_k)
accounts for the reconstruction loss. While the second term log p_theta(z_k)
is kind of tricky to me.
mu
and var
rather than log_normal_diag(z) - log_normal_standard(z)
according to Appendix B in VAE paper. It seems that if using the loss.py
here to calculate the VAE loss is inconsistent.According to flows and Figure 2:
The z_k
in the loss.py
code is actually p_k(z_k|x)
in Figure 2. The loss here is calculated as log_normal_standard(z_k)
. Under this situation, z_k
should not be close to a standard normal prior because if the input to the decoder is close to the standard normal, it's just a VAE without flows. We don't want z_0
, a normal, transform into z_k
, a standard normal. I think maybe z_0
should be close to the prior rather than z_k
?
A small request, I cannot figure out the detailed derivation of Eq. (7) because the notation of VI with NF is quite different from papers (including the original VI with NF paper). I actually cannot see how this ELBO is related to the loss calculation.
Thank you again with such a nice work and code.
In a vanilla VAE, the KL divergence is computed between two normal distributions. There is a well known, closed form analytical solution for this, hence the use of mu
and var
directly. With normalizing flows, we are dealing with a transformed density, thus we cannot use the closed form Normal-Normal KL divergence. ~While it might be possible to derive an analytical KL divergence for a flow density (I don't know, I haven't tried)~, we don't need to (edit: Actually, this isn't possible unless f
has a constant/diagonal Jacobian, which would obviously be an overly restrictive constraint). We can just compute a Monte-Carlo estimate of the KL divergence by taking the expectation over the initial density. No inconsistencies here!
See the previous comment about LOTUS. It's not immediately intuitive, but it is provably correct to take the expectation over the prior. Note that z_k
does not follow a standard normal as you say but rather the transformed density learned by the flow.
I think you are asking, "how do I get from the ELBO (equation 2) to the objective (equation 7)?". Rewrite the KL divergence as the expectation of the log difference, distribute the negative sign, add log p(x|z)
to log p(z)
to get log p(x,z)
, change the sign, and you will get equation 6. Apply equation 5 to arrive at equation 7.
@bgroenks96 Thank you for your detailed reply!
Like @riannevdberg replied above, "Log p(zk) in line 81 of loss.py computes the log probability of z_k under the prior distribution p". Is this the prior of z_k
or z_0
? Are we assuming a standard normal as the prior of z_0
or z_k
? I don't quite get why we need to know the probability of z_k
under the prior. There isn't any term in Eq. (7) consists of z_k
and a prior.
I am a little bit confused by which two terms should the KL calculate based on? The following ELBO is from page 4 of the slide from Jakub Tomczak at ICML 2019 workshop (which I think is the same as Eq. (7) here). It has the vanilla VAE loss form, a reconstruction loss and a KL divergence:
I assume that the p_lambda(z_k)
in KL is the probability after the flow transformation (parametrized with lambda
) of the standard normal prior p(z_0)
here? Or I make it wrong? It is kind of frustrating to understand a KL between q_0(z_0|x)
and p_{lambda}(z_k)
.
“There isn’t any term in Eq. (7) consists of z_k and a prior”.
There is, it is hidden inside log p(x, z_k) = log p(x|z_k) + log p(z_k)
I assume that the p_lambda(z_k) in KL is the probability after the flow transformation (parametrized with lambda) of the standard normal prior p(z_0) here? Or I make it wrong? It is kind of frustrating to understand a KL between q_0(z0|x) and p{lambda}(z_k).
p_lambda(z_k) is really the density of z_k evaluated under p_lambda, which is a standard normal in this case. It is indeed a bit confusing to write KL(q_0(z0|x) || p\lambda(z_k)), because you usually make sure that both distributions are over the same random variable. What is meant with this KL is really KL(q_0(z0|x) || p\lambda(z_k)) = E_q_0[ log q_0(z_0|x) - log p_lambda(z_k)].
Thank you for the reply! I can understand why it is log standard normal probability now :) Many thanks!
hello Rianne: Thanks very much. I am a bit confused with line 44 in loss.py : loss = bce + beta kl. Based on equation 3 in Tomczak's paper (Improving Variational Auto-Encoder Using Householder Flows), shouldn't "loss = bce - beta kl "? Also, why use -ELBO instead of ELBO when reporting your metrics? Thanks
Hi,
I am a bit confused with line 44 in loss.py : loss = bce + beta kl. Based on equation 3 in Tomczak's paper (Improving Variational Auto-Encoder Using Householder Flows), shouldn't "loss = bce - beta kl "?
We want to maximize the log likelihood (intractable), and the ELBO is a lower bound to the log likelihood so maximizing the ELBO is what we want to do. This is equivalent to minimizing the -ELBO. -ELBO is the negative of the expression that is in eq 3 of Tomczak's paper. So loss = -ELBO = - E_q(z|x)[p(x|z)] + KL(q(z|x)|p(z)). Note the binary cross entropy (bce) has a minus sign in its definition: bce(q||p) = - E_q(z|x)[p(x|z)].
Also, why use -ELBO instead of ELBO when reporting your metrics?
This is just a choice that doesn't really matter, several papers in the literature report -ELBO. Either you report ELBO and higher is better, or you report -ELBO and lower is better.
I hope this helps.
Hi Rianne,
Thank you very much for the amazing paper, code, and all the reply comments above!!
I notice your discussion above with @RuihongQiu. In fact, I have a similar confusion. Like what @bgroenks96 has mentioned, From LOTUS, I understand that we can sample from q(z_0) instead of q(z_k) to estimate the expectation of p(z_k) over q(zk) (i.e. E{z_k ~ q} p_theta(z_k)). However, I still struggle to understand why we should calculate the probability of z_k under the prior (i.e. line 81 in loss.py). Because as far as what I see, p(z_k) should be considered as a function of z_k, rather than a transformed density, if we want to the expectation under q distributions (i.e. q(z_k) or q(z_0)). In other words, we can sample from q(z_0) to estimate p(z_k), but we still need to know the value of p(z_k) (which to my understanding is the probability density of z_k in the prior transformed by the flow rather than the standard gaussian prior itself) in each sample to estimate it?
I am not sure if I've made a mistake here or if I've made my question clear. Thank you very much in advance for your time and patience.
Yupei
Hi Rianne, This is a great code, and I have a little question about logp(zk), we hope p(zk) in VAE can be a distribution whose form is no fixed, but it seems that the calculate of logp(zk) in line81 of loss.py imply that p(zk) is a standard Gaussion. Are there some mistakes about my understanding?
Thank your for this code