mfederici / Multi-View-Information-Bottleneck

Implementation of Multi-View Information Bottleneck
124 stars 17 forks source link

Calculation process of the KL divergence #6

Closed xiami2019 closed 2 years ago

xiami2019 commented 2 years ago

Hello, nice work! I have a few questions about how to calculate the KL divergence.

kl_1_2 = p_z1_given_v1.log_prob(z1) - p_z2_given_v2.log_prob(z1) kl_2_1 = p_z2_given_v2.log_prob(z2) - p_z1_given_v1.log_prob(z2) skl = (kl_1_2 + kl_2_1).mean() / 2.

These are the calculation steps of SKL divergence in mib.py. I wonder that if it is correct that using the probability of a single sample to calculate the KL divergence(kl_1_2 and kl_2_1). I think the KL should be calculated based on the expectation of the distribution. So there might be a little mistake here, since the code above ignores the expectation of the p_z_given_v. Looking forward to your reply.

mfederici commented 2 years ago

Hi, First of all, thank you for your comment! You are correct, the KL divergence is computed using only one sample. Of course, one could use more samples to estimate this value, but in practice this already works well enough. The sampling and estimation procedure is analogous to the one used to estimate KL divergence in variational autoencoders and can be thought as a Monte Carlo simulation with a sample size of 1. We haven't extensively explored this direction since our work is not the first to estimate KL divergences with 1 sample.

Please feel free to follow up if you have further questions.

xiami2019 commented 2 years ago

Hi, thanks for your reply. It helps a lot. Besides, I notice that the distribution p_z1_given_v1 are normal distributions whose mu and sigma are known. Have you ever tried calculating the KL divergence directly using the mu and sigma of the p_z1_given_v1 and p_z2_given_v2. If you have tried this, how does its performance compare to the current estimation methods.

mfederici commented 2 years ago

Hi, while it is possible to directly compute the KL divergence between Normals, we haven't observed any significant advantage to do so, as far as I recall. Once again, this is similar to the KL between Normals you would find in a VAE objective and you can find implementations that use either strategy. From my personal experience, the sample-based KL is a bit weaker than the closed-form one as a regularizer. The sample based version also has the advantage that it can be used with more complex encoding distributions (e.g. flow-transormed distributions) without the need to change any code.

xiami2019 commented 2 years ago

Get it, thank you!