MINE as an estimation of Mutual Information between input space and latent representation

Thank you very much for your interesting work and releasing the code, much appreciated !!

I am implementing several manifold learning methods (from 64x64 images to 3D) that includes a jointly optimization of mutual information (MI) with MINE tricks (DIM, InfoMax VAE, ...) As in those methods we are interested in maximizing MI (and not get its precise value), I understand the use of a more stable (but less tight) lower bound of MI, as Jensen-Shannon Divergence or InfoNCE.

However, as you did also in you DIM research paper (?), I want now to use Mutual information between input space and latent representation as a quantitative metrics to evaluate the latent code, and be able to compare for instance with state of the art technics (UMAP, t-SNE).

As we want a precise estimation of MI, do you agree that :

Using the Donsker-Varadhan representation(DV) of the KL divergence is needed (as it is the tightest bound on MI)
As this bound lead to a biased gradient, there is a need of a correction with moving average (as suggested in the MINE paper)

I tried several implementation of this, have an overall coherent MI behavior, but very unstable (no clear asymptote at all), it would be difficult to extract a single MI estimation from the output. Therefore I failed to use MINE as a metrics to compare different dimensionality reduction technics. It would be so helpful if you could share your implementation of MINE for that purpose, or just some insight on the architecture you used, the optimizer, the lower bound on MI you used.

Any advice is welcome. Thank you so much in advance !

rdevon / DIM

MINE as an estimation of Mutual Information between input space and latent representation #40