paulbrodersen / entropy_estimators

Estimators for the entropy and other information theoretic quantities of continuous distributions
GNU General Public License v3.0
132 stars 26 forks source link

Unexpected -inf entropy estimations #3

Closed xanderdunn closed 3 years ago

xanderdunn commented 3 years ago

Thanks for providing this open-source implementation of these entropy estimators.

Here are two datasets in .csv format that are unexpectedly giving me -inf entropy estimations: Datasets.zip

Plotting the datasets shows that they are pretty reasonable. This is test_dataset_1 (a sine wave with some added Gaussian noise): 6_TrigData sin

This is test_dataset_2: test_dataset_2

When I estimate their entropies:

#!/usr/bin/env python3

import pandas as pd
from entropy_estimators import continuous

data1 = pd.read_csv("./data/test_set_1.csv").iloc[:, 0].values
print(data1)
print("{} points {} sum in data1".format(len(data1), sum(data1)))

data2 = pd.read_csv("./data/test_set_2.csv").iloc[:, 0].values
print(data2)
print("{} points {} sum in data2".format(len(data2), sum(data2)))

print(continuous.get_h(data1, k=5))
print(continuous.get_h(data2, k=5))

I get this output:

[ 0.13657672 14.01610649 29.53076472 ... 89.05243545 63.88229066
 46.86725742]
5000 points 229158.5019361417 sum in data1
[ 0.00315008  0.00165544  0.00293935 ...  0.00210567  0.00113199
 -0.00142328]
2382 points 0.14870037135740477 sum in data2
python3.7/site-packages/entropy_estimators/continuous.py:224: RuntimeWarning: divide by zero encountered in log
  sum_log_dist = np.sum(log(2*distances)) # where did the 2 come from? radius -> diameter
-inf
-inf

I haven't found any commonality between the two that explains it. data_set_2 has values that are 0.0, but data_set_1 doesn't, although data_set_1 does have values as small as 1e-5. Surely we expect the K-L entropy estimation technique to be robust to datasets such as these?

paulbrodersen commented 3 years ago

Hi, thanks for raising the issue. The entropy estimator is designed for continuous variables, and as a result it assumes that all values in your data are unique (as the probability of encountering the same value multiple times in samples from truly continuous variables is infinitesimally small). However, you seem to have at least one value in your data that is repeated multiple times. As a result, there are k-nearest neighbour distances in your data set that are zero. As the entropy is a function of the logarithm of these distances, you run into infinite values. A common solution in the literature is to add a small amount of Gaussian noise to the data with a standard deviation that is on the order of your measurement precision. If the number of non-unique values is very small, then you may also want to simply drop them.

xanderdunn commented 3 years ago

Ah, this makes sense, thank you very much.

Indeed, there are 235/5000 data points that are duplicates in test_set_1, and there are 6/2382 data points in test_set_2 that are duplicates.

Here is a third dataset: test_set_3.csv.zip 95/2383 of those data points are the same, but we don't get a -inf entropy estimation for this one. continuous.get_h(data_3, k=5) produces an entropy of 5.907636996354834.

Do we think this is because there is some chance that the equal data points won't be chosen in the k=5-pairs distance estimator?

paulbrodersen commented 3 years ago

Well, the estimator is only looking at the kth distance, i.e. here the fifth distance, and hence the distances to the neighbours 1-4 can be zero without causing any issues. I would assume that in the third data set none of your values are repeated more than 4 times.

paulbrodersen commented 3 years ago

FYI, I had forgotten this but the last time I ran into this issue, I implemented a min_dist argument in get_h that allows you to cap distances to a minimum value. In hindsight, I probably should not have set the default value to 0.

xanderdunn commented 3 years ago

Ah, that helped, thanks! I still had one set of data that caused a large inefficiency for some reason when I attempted to call get_h on it, even with a min_dist set. It took 8 minutes to calculate the entropy whereas it takes seconds for other datasets of the same size. 10332/22999 data points were non-unique and so it may be related to the 0-distances issue, but the performance issue persisted even after setting a min_dist.

Introducing noise in the same manner as it's done in the npeet entropy estimator here not only solved the issue of a -inf result, but also solved the performance problem on that particular dataset, taking it from 8 minutes to a couple of seconds.

Thanks for your help, I have a better understanding of these methods now!

paulbrodersen commented 3 years ago

Always a pleasure when you get such a good issue report.