riannevdberg / sylvester-flows

MIT License
177 stars 32 forks source link

about log_p_zk #4

Closed Archer666 closed 4 years ago

Archer666 commented 4 years ago

Hi Rianne, This is a great code, and I have a little question about logp(zk), we hope p(zk) in VAE can be a distribution whose form is no fixed, but it seems that the calculate of logp(zk) in line81 of loss.py imply that p(zk) is a standard Gaussion. Are there some mistakes about my understanding?
Thank your for this code

bgroenks96 commented 4 years ago

The law of the unconscious statistician permits expectations over the transformed density to be computed w.r.t the initial density, provided that the transformation is differentiable and monotonic. Normalizing flows satisfy this constraint thanks to their invertibility.

riannevdberg commented 4 years ago

Log p(zk) in line 81 of loss.py computes the log probability of z_k under the prior distribution p. We assume that the prior in our model is a standard normal. In general you don't need to make this assumption. In this code we use a normalizing flow on the posterior q(z|x). As bgroenks96 already pointed out, you can use the law of the unconscious statistician to compute expectations w.r.t. the transformed density by sampling from the untransformed density.

RuihongQiu commented 4 years ago

Hi @riannevdberg and @bgroenks96, thank you for the sharing of the code and the explanation.

I have similar questions here. According to my understanding, the objective is Eq. (7) in the paper: image

The log p_theta(x, z) is actually equal to log p_theta(x|z_k) + log p_theta(z_k). The first term log p_theta(x|z_k) accounts for the reconstruction loss. While the second term log p_theta(z_k) is kind of tricky to me.

  1. The KL in VAE is calculated based on mu and var rather than log_normal_diag(z) - log_normal_standard(z) according to Appendix B in VAE paper. It seems that if using the loss.py here to calculate the VAE loss is inconsistent.

According to flows and Figure 2: image

  1. The z_k in the loss.py code is actually p_k(z_k|x) in Figure 2. The loss here is calculated as log_normal_standard(z_k). Under this situation, z_k should not be close to a standard normal prior because if the input to the decoder is close to the standard normal, it's just a VAE without flows. We don't want z_0, a normal, transform into z_k, a standard normal. I think maybe z_0 should be close to the prior rather than z_k?

  2. A small request, I cannot figure out the detailed derivation of Eq. (7) because the notation of VI with NF is quite different from papers (including the original VI with NF paper). I actually cannot see how this ELBO is related to the loss calculation.

Thank you again with such a nice work and code.

bgroenks96 commented 4 years ago
  1. In a vanilla VAE, the KL divergence is computed between two normal distributions. There is a well known, closed form analytical solution for this, hence the use of mu and var directly. With normalizing flows, we are dealing with a transformed density, thus we cannot use the closed form Normal-Normal KL divergence. ~While it might be possible to derive an analytical KL divergence for a flow density (I don't know, I haven't tried)~, we don't need to (edit: Actually, this isn't possible unless f has a constant/diagonal Jacobian, which would obviously be an overly restrictive constraint). We can just compute a Monte-Carlo estimate of the KL divergence by taking the expectation over the initial density. No inconsistencies here!

  2. See the previous comment about LOTUS. It's not immediately intuitive, but it is provably correct to take the expectation over the prior. Note that z_k does not follow a standard normal as you say but rather the transformed density learned by the flow.

  3. I think you are asking, "how do I get from the ELBO (equation 2) to the objective (equation 7)?". Rewrite the KL divergence as the expectation of the log difference, distribute the negative sign, add log p(x|z) to log p(z) to get log p(x,z), change the sign, and you will get equation 6. Apply equation 5 to arrive at equation 7.

RuihongQiu commented 4 years ago

@bgroenks96 Thank you for your detailed reply!

Like @riannevdberg replied above, "Log p(zk) in line 81 of loss.py computes the log probability of z_k under the prior distribution p". Is this the prior of z_k or z_0? Are we assuming a standard normal as the prior of z_0 or z_k? I don't quite get why we need to know the probability of z_k under the prior. There isn't any term in Eq. (7) consists of z_k and a prior.

I am a little bit confused by which two terms should the KL calculate based on? The following ELBO is from page 4 of the slide from Jakub Tomczak at ICML 2019 workshop (which I think is the same as Eq. (7) here). It has the vanilla VAE loss form, a reconstruction loss and a KL divergence: image I assume that the p_lambda(z_k) in KL is the probability after the flow transformation (parametrized with lambda) of the standard normal prior p(z_0) here? Or I make it wrong? It is kind of frustrating to understand a KL between q_0(z_0|x) and p_{lambda}(z_k).

riannevdberg commented 4 years ago

“There isn’t any term in Eq. (7) consists of z_k and a prior”.

There is, it is hidden inside log p(x, z_k) = log p(x|z_k) + log p(z_k)

I assume that the p_lambda(z_k) in KL is the probability after the flow transformation (parametrized with lambda) of the standard normal prior p(z_0) here? Or I make it wrong? It is kind of frustrating to understand a KL between q_0(z0|x) and p{lambda}(z_k).

p_lambda(z_k) is really the density of z_k evaluated under p_lambda, which is a standard normal in this case. It is indeed a bit confusing to write KL(q_0(z0|x) || p\lambda(z_k)), because you usually make sure that both distributions are over the same random variable. What is meant with this KL is really KL(q_0(z0|x) || p\lambda(z_k)) = E_q_0[ log q_0(z_0|x) - log p_lambda(z_k)].

RuihongQiu commented 4 years ago

Thank you for the reply! I can understand why it is log standard normal probability now :) Many thanks!

tumis1946 commented 3 years ago

hello Rianne: Thanks very much. I am a bit confused with line 44 in loss.py : loss = bce + beta kl. Based on equation 3 in Tomczak's paper (Improving Variational Auto-Encoder Using Householder Flows), shouldn't "loss = bce - beta kl "? Also, why use -ELBO instead of ELBO when reporting your metrics? Thanks

riannevdberg commented 3 years ago

Hi,

I am a bit confused with line 44 in loss.py : loss = bce + beta kl. Based on equation 3 in Tomczak's paper (Improving Variational Auto-Encoder Using Householder Flows), shouldn't "loss = bce - beta kl "?

We want to maximize the log likelihood (intractable), and the ELBO is a lower bound to the log likelihood so maximizing the ELBO is what we want to do. This is equivalent to minimizing the -ELBO. -ELBO is the negative of the expression that is in eq 3 of Tomczak's paper. So loss = -ELBO = - E_q(z|x)[p(x|z)] + KL(q(z|x)|p(z)). Note the binary cross entropy (bce) has a minus sign in its definition: bce(q||p) = - E_q(z|x)[p(x|z)].

Also, why use -ELBO instead of ELBO when reporting your metrics?

This is just a choice that doesn't really matter, several papers in the literature report -ELBO. Either you report ELBO and higher is better, or you report -ELBO and lower is better.

I hope this helps.

Yupei-Du commented 2 years ago

Hi Rianne,

Thank you very much for the amazing paper, code, and all the reply comments above!!

I notice your discussion above with @RuihongQiu. In fact, I have a similar confusion. Like what @bgroenks96 has mentioned, From LOTUS, I understand that we can sample from q(z_0) instead of q(z_k) to estimate the expectation of p(z_k) over q(zk) (i.e. E{z_k ~ q} p_theta(z_k)). However, I still struggle to understand why we should calculate the probability of z_k under the prior (i.e. line 81 in loss.py). Because as far as what I see, p(z_k) should be considered as a function of z_k, rather than a transformed density, if we want to the expectation under q distributions (i.e. q(z_k) or q(z_0)). In other words, we can sample from q(z_0) to estimate p(z_k), but we still need to know the value of p(z_k) (which to my understanding is the probability density of z_k in the prior transformed by the flow rather than the standard gaussian prior itself) in each sample to estimate it?

I am not sure if I've made a mistake here or if I've made my question clear. Thank you very much in advance for your time and patience.

Yupei