Add normalized version of MVN entropy estimator and fix floating point issue

mahlzahn commented 11 months ago

1. Add agument `normalized`

Add an argument normalized to the get_h_mvn function which returns the entropy of the normalized MVN distribution by normalizing such that its variance is 1 and the covariance matrix becomes equal to the Pearson correlation coefficients. Thus, the entropy becomes invariant under (some) linear transformation (scalar multiplication).

import numpy as np
from entropy_estimators.continuous import get_h_mvn
rng = np.random.default_rng(seed=0)
a = rng.normal(scale=1, size=10000).reshape(-1, 1)
b = rng.normal(scale=2, size=10000).reshape(-1, 1)
c = 5 * a + 7
d = np.c_[a,b]
r = np.pi / 4
rot = [[np.cos(r), -np.sin(r)], [np.sin(r), np.cos(r)]]
e = (rot @ d.T).T
f = d * [5, 2] + [-3, 8]
g = np.c_[a, a + b/1e5]
h = np.c_[a, a + b/1e9]
i = np.c_[a, a]
dists = [a, b, c, d, e, f, g, h, i]
print('|  |a |b |5a+7|d=[a b]|rot(d)|[5 2]⋅d+[-3 8]|[a a+b/1e5]|[a a+b/1e9]|[a a]|')
print('|--|--|--|----|-------|------|--------------|-----------|-----------|-----|')
print('|μ', *[' '.join([f'{s:.2f}' for s in np.ravel(x.mean(axis=0))]) for x in dists], sep='|', end='|\n')
print('|σ', *[' '.join([f'{s:.2f}' for s in np.ravel(x.std(ddof=1, axis=0))]) for x in dists], sep='|', end='|\n')
print('|H', *[f'{get_h_mvn(x):.2f}' for x in dists], sep='|', end='|\n')
print('|H’', *[f'{get_h_mvn(x, normalized=True):.2f}' for x in dists], sep='|', end='|\n')

calculates the entropy H and the normalized entropy H’ for two distributions a and b and a third is c=5a+10, etc.:

	a	b	5a+7	d=[a b]	rot(d)	[5 2]⋅d+[-3 8]	[a a+b/1e5]	[a a+b/1e9]	[a a]
μ	0.01	0.01	7.03	0.01 0.01	0.00 0.01	-2.97 8.01	0.01 0.01	0.01 0.01	0.01 0.01
σ	1.00	1.99	4.99	1.00 1.99	1.58 1.56	4.99 3.98	1.00 1.00	1.00 1.00	1.00 1.00
H	1.42	2.11	3.03	3.52	3.52	5.83	-7.99	nan	-inf
H’	1.42	1.42	1.42	2.84	2.62	2.84	-7.99	nan	-inf

Thus, the normalized entropy of a MVN random variable X with dimension d is equal to

H(X) = d ⋅ log(2 ⋅ π ⋅ e) / 2 = 1.42 ⋅ d.

This is also the maximum normalized entropy for a d-dimensional variable. It is lower if the components are correlated, e.g., in the case of rotated 2D MVN random variable (see table above).

2. Fix floating point issue

The current implementation fails to calculate the entropy properly of highly correlated variables because of float resolution. I fixed this by returning -inf if the determinant of the Pearson correlation coefficients matrix equals 0 and nan if the determinant is close to 0 (|det(…)|<10⁻¹³). The last three columns of above table demonstrate the new behaviour. The entropy of [a a+b/1e5] is -7.99, of [a a+b/1e9] is nan and of [a a] is -inf, indicating that the second one cannot be calculated.

3. Speed-up of MVN entropy estimate for 1D variables

… by using the variance instead of the covariance matrix calculation

paulbrodersen commented 11 months ago

Add argument normalized

Could you expand a bit on the motivation, or provide some references and/or applications?

Fix floating point issue. The current implementation fails to calculate the entropy properly of highly correlated variables because of float resolution.

Much appreciated.

Speed-up of MVN entropy estimate for 1D variables by using the variance instead of the covariance matrix calculation.

Did you time it? Since both implementations ultimately rely on LAPACK/OpenBLAS, I would be shocked if the difference was substantial (> 1.5x).

mahlzahn commented 11 months ago

Speed-up of MVN entropy estimate for 1D variables by using the variance instead of the covariance matrix calculation.

Did you time it? Since both implementations ultimately rely on LAPACK/OpenBLAS, I would be shocked if the difference was substantial (> 1.5x).

~2 times in my tests. As I am running entropy for thousands of variables or pairs, I’d say it matters (a bit) ;)

paulbrodersen commented 11 months ago

Speed-up of MVN entropy estimate for 1D variables by using the variance instead of the covariance matrix calculation.

Did you time it? Since both implementations ultimately rely on LAPACK/OpenBLAS, I would be shocked if the difference was substantial (> 1.5x).

~2 times in my tests. As I am running entropy for thousands of variables or pairs, I’d say it matters (a bit) ;)

Alright, I hate the increase in code complexity but we don't leave factors of two on the table.

paulbrodersen commented 10 months ago

When you have time, could you expand a bit on the motivation for the normalization, or provide some references and/or applications? I don't want to support something even I don't understand. ;-)

paulbrodersen / entropy_estimators