Closed vmand4 closed 3 years ago
Hi @vmand4
Sorry for the late reply -- I was on holiday.
What is "n" here?
If "n" is the dimensionality of your data, then log(n) is proportional to the size of the state space and then the normalization makes some sense if you are comparing apples to oranges -- for example, a collection of binary vectors of length 3 with a collection of binary vectors of length 4. If your continuous data is sampled from bounded domains of varying size, then you could normalize by the entropies of the uniform distributions on those domains. If your continuous data is sampled from unbounded domains, the maximum entropy is infinity, and then such a normalization is impossible.
If "n" is the total number of samples, then this normalization does not make much sense to me. Entropies are computed from probabilities (or some proxy thereof). Your total probability mass should always sum to one, and hence in theory you should not need to normalize by some measure of the number of samples -- neither in the discrete, nor in the continuous case. This is not quite true for the continuous case, as the K-L estimator has a bias that is dependent on the total number of samples. However, that bias does not scale with 1/log(N). People have tried to estimate the bias empirically by subsampling their data, computing the entropy for successively smaller samples, and then extrapolating to infinity. If you have a lot of samples to begin with, that may work for you.
Hope that helps.
Best, Paul
I am closing this as this is not really an issue with the code base. Feel free to re-open if you have any remaining questions.
Thank you Paul for sharing your insights. Very helpful.
Hello Paul,
Awesome work. I am using your estimators to calculate entropy for continuous variables (get_h).
I am trying to normalize the entropy from get_h with maximal entropy (log(n)) so that the scale will be between 0 and 1, but one thing I am not sure of is, whether this way (dividing with log(n)) can be done to a continuous variable entropy as well?