Closed xanderdunn closed 3 years ago
Hi, thanks for raising the issue. The entropy estimator is designed for continuous variables, and as a result it assumes that all values in your data are unique (as the probability of encountering the same value multiple times in samples from truly continuous variables is infinitesimally small). However, you seem to have at least one value in your data that is repeated multiple times. As a result, there are k-nearest neighbour distances in your data set that are zero. As the entropy is a function of the logarithm of these distances, you run into infinite values. A common solution in the literature is to add a small amount of Gaussian noise to the data with a standard deviation that is on the order of your measurement precision. If the number of non-unique values is very small, then you may also want to simply drop them.
Ah, this makes sense, thank you very much.
Indeed, there are 235/5000 data points that are duplicates in test_set_1, and there are 6/2382 data points in test_set_2 that are duplicates.
Here is a third dataset: test_set_3.csv.zip
95/2383 of those data points are the same, but we don't get a -inf
entropy estimation for this one. continuous.get_h(data_3, k=5)
produces an entropy of 5.907636996354834
.
Do we think this is because there is some chance that the equal data points won't be chosen in the k=5
-pairs distance estimator?
Well, the estimator is only looking at the kth distance, i.e. here the fifth distance, and hence the distances to the neighbours 1-4 can be zero without causing any issues. I would assume that in the third data set none of your values are repeated more than 4 times.
FYI, I had forgotten this but the last time I ran into this issue, I implemented a min_dist
argument in get_h
that allows you to cap distances to a minimum value. In hindsight, I probably should not have set the default value to 0
.
Ah, that helped, thanks! I still had one set of data that caused a large inefficiency for some reason when I attempted to call get_h
on it, even with a min_dist
set. It took 8 minutes to calculate the entropy whereas it takes seconds for other datasets of the same size. 10332/22999 data points were non-unique and so it may be related to the 0-distances issue, but the performance issue persisted even after setting a min_dist
.
Introducing noise in the same manner as it's done in the npeet entropy estimator here not only solved the issue of a -inf
result, but also solved the performance problem on that particular dataset, taking it from 8 minutes to a couple of seconds.
Thanks for your help, I have a better understanding of these methods now!
Always a pleasure when you get such a good issue report.
Thanks for providing this open-source implementation of these entropy estimators.
Here are two datasets in .csv format that are unexpectedly giving me
-inf
entropy estimations: Datasets.zipPlotting the datasets shows that they are pretty reasonable. This is test_dataset_1 (a sine wave with some added Gaussian noise):
This is test_dataset_2:
When I estimate their entropies:
I get this output:
I haven't found any commonality between the two that explains it. data_set_2 has values that are 0.0, but data_set_1 doesn't, although data_set_1 does have values as small as
1e-5
. Surely we expect the K-L entropy estimation technique to be robust to datasets such as these?