continuous entropy with KNN

benam2 commented 5 years ago

I have a question here, how can I calculate continuous entropy using KNN over each row? As far as I understood the code, when we apply ce over the whole matrix, but what if I need to check what is the entropy of one sample of data?

Do you have any idea how can I change the code to behave like this? And Does it make sense at all?

paulbrodersen commented 5 years ago

Entropy is a property of a distribution. A single sample is -- by definition -- not a distribution, so computing the entropy of a sample is non-sensical.

If you have multiple (M) distributions with N samples each saved in the form of a M-by-N array, then you can compute the entropy of each distribution using a simple for loop:

import numpy as np
from continuous import get_h

m = 10
n = 1000
data = np.random.randn(m, n)
entropies = np.zeros((m))

for ii, samples in enumerate(data):
    entropies[ii] = get_h(samples[:, np.newaxis], k=5)

Hope that helps.

paulbrodersen commented 5 years ago

Feel free to re-open is this does not answer your question.

benam2 commented 5 years ago

Thank you so much for explaining in detail. Actually, I meant I have a M by N matrix. So, even if I want to apply entropy in each row separately, but each row itself consist of N samples (N columns).

With these explanations do you still think that entropy might not make sense? I appreciate your help.

paulbrodersen commented 5 years ago

Are these actually N distinct measurements of the same quantity (i.e. do all values have the same units), or are these different properties of 1 sample?

It would help if you explained in detail what the data is, where the data is coming from, and what question you want to answer.

benam2 commented 5 years ago

Actually, they are N distinct measurements of the same quantity (yes all values have the same units).

So the data is a large matrix coming from a layer in Neural Network. Therefore, we have M standing for M different features, and N which is like which cluster this feature belong to.

To give more detail, I am working on an Autoencoder model, and in the middle layer, I have one matrix, in which the rows of this matrix are my features say (computer, internet, business, windows, deal, price, keyboard), and the columns are which cluster these features belong to. so here we will have 7 rows and lets assume 2 column. My matrix will look like this


           computer   [[0.9  , 0.1  ],
           internet      [0.8  , 0.09 ],
           windows     [0.76 , 0.004],
           price           [0.009, 0.1  ],
           business     [0.02 , 0.34 ],
           keyboard    [0.3  , 0.5  ],
           deal            [0.004, 0.76 ]]

So here, I will calculate entropy on first row, which it has two value. the first value is 0.9 column 1, and the second value for the same feature is 0.1 column2. and here, the value 0.9 is more compared with 0.1 which means this feature belongs to the first cluster or column.

I appreciate you putting time on this.

Thanks~

paulbrodersen commented 5 years ago

Let me play back to you what I think you want to do to make sure I understand you correctly:

You have an autoencoder with one hidden layer. In your example, there are 7 input nodes / features and 2 hidden nodes.
You want to look at the entropy of the connectivity between the input layer and the hidden layer by input layer node / feature.
You want to do this presumably to quantify how distributed the representation of an input feature is?

If that is what you have in mind: neat idea.

benam2 commented 5 years ago

Actually, exactly what I want to do except that I want to look at the reverse of entropy for each feature to see how concentrated each feature is relative to clusters(I'm not interested in the features in which the distribution is uniform). (I should also mention that I'm not sure with the weights in NN doing this makes any sense at all. assume that the range of weights will be always between 1 and -1). I did this already with bin, however, it was hard to come up with a good split for the beans so I wanted to try this with KNN which according to the code I could not figure it out how to do it (it wanted to apply entropy on several rows). I highly appreciate if you can also share your view on this with me.

paulbrodersen commented 5 years ago

In principle, using a k-nearest neighbour approach should be fine, certainly if you are only interested in a ranking your nodes from highest to lowest entropy.
The only thing that slightly worries me is the finite domain between -1 and 1. Intuitively, points at the extrema can only have neighbours on one side. Hence I feel like the k-nearest neighbour distances of extrema might be biased to be larger, which would be reflected by in an underestimate of their entropy. I am not completely sure myself about this argument. As long as you are primarily interested in a ranking and as long as you don't need a precise, unbiased estimate of their entropy in bits/nats everything should be fine.

benam2 commented 5 years ago

Thank you so much for spending time and sharing your view with me. I really appreciate it.

So, in terms of the implementation things, do you have any idea how can I implement it in tensorflow? I know I can call the tensor over each row in this way:

ent_p = tf.map_fn(get_h, X, dtype=tf.float32) This can be something like this:


def score(X):

    def get_h(x):
        k = 1
        x = tf.reshape(x, shape=[1, -1])
        norm = np.inf
        min_dist = 0.
        n = x.shape[0]
        d = x.shape[1]
        # manhattan distance
        distance = tf.reduce_sum(tf.abs(tf.subtract(x, tf.expand_dims(x, 1))), axis=2)
        # nearest k points
        _, top_k_indices = tf.nn.top_k(tf.negative(distance), k=k)

        return top_k_indices

    ent_p = tf.map_fn(get_h, X, dtype=tf.float32)
    return ent_p

But sounds like something is wrog with this script, Am I missing something here?

Again many Thanks for your help~

paulbrodersen commented 5 years ago

I have never used tensorflow (I write my own stuff from scratch, as I am more interested in the fundamentals rather than performance). However, as far as I can tell, you are just computing the nearest neighbours, not the entropy based on the distance to the nearest neighbours? Look at the source code for get_h in my implementation. The core lines are:

sum_log_dist = np.sum(log(2*distances)) 
h = -digamma(k) + digamma(n) + log_c_d + (d / float(n)) * sum_log_dist

benam2 commented 5 years ago

Yea Actually the code above is not complete as It raises error while I run the code. Thank you so I will figure it out. If I could get the tf implementation I will share with you.

Thanks again for all your help.

benam2 commented 5 years ago

I appreciate if you can answer this question of mine:

what do you conclude about a distribution which has a bigger entropy compared with another distribution over continuous data?

paulbrodersen commented 5 years ago

Fundamentally, entropy is a measure of dispersion or spread. For unimodal distributions, entropy is hence tightly coupled to the variance of the distribution. For example, for a Gaussian distribution the entropy is 0.5 + ln(sqrt(2pi) sigma), where sigma is the standard deviation of the distribution. So if one of your distributions has a higher entropy, its probability mass is more more spread out.

There are tons of good books about information theory. Cover and Thomas is the standard textbook but maybe a bit much to just get some intuitions. For that, Claude Shannon's original treatise is actually quite good.

benam2 commented 5 years ago

Thank you so much, for explaining in detail. Could you please share your idea regarding this: I have followed an approach and I have got good results, though, I have not considered a supposition (by mistake). now I am wondering is there any way I can justify it.

So in the end, I figured out the entropy using the gaussian method. 'gaussian' computes the binless entropy based on estimating the covariance matrix and assuming the data is normally distributed. The thing is that my data is not normally distributed. My data is z in neural network (z= tanh(w *x +b)).

Now my question is that is it totally wrong if we violate the supposition.

I am desperately looking for the answer to this question. I spent several months and designed what I wanted then I got to know that my data does not have normal distribution but I have used gaussian method for computing entropy.

I greatly appreciate if you share your knowledge regarding this with me, and really sorry to taking your time.

paulbrodersen commented 5 years ago

Are you still using my code to compute the entropies? If so, which function are you using exactly?

If your data is non-normally distributed, you should not be using get_h_mvn. It will return garbage for non-normal distributions. The function get_h makes no assumption about the distribution of the data, so it is fine to use that one.

benam2 commented 5 years ago

Actually I could not figure out how to program the knn in tensorflow. That's why I use the gaussian method because I could convert the code to gaussian.

This is the numpy version:```

elif method == 'gaussian':
    from numpy.linalg import det

    if data is None:
        raise ValueError('Nearest neighbors entropy requires original data')

    detCov = det(np.dot(data.transpose(), data)/num_samples)
    normalization = (2*np.pi*np.exp(1))**num_dimensions

    if detCov == 0:
        return -np.inf
    else:
        if units is 'bits':
            return 0.5*np.log2(normalization*detCov)
        elif units is 'nats':
            return 0.5*np.log(normalization*detCov)
        else:
            print('Units not recognized: {}'.format(units))


And this is the tensorflow version:

 def row_entropy(row):
                    import tensorflow as tf
                    if row is not None:
                        data = tf.reshape(row, shape=[1, -1])
                        # data = row
                        num_samples = data.shape[0]
                        if len(data.shape) == 1:
                            num_dimensions = 1
                        else:
                            num_dimensions = data.shape[1]
                    epsilon = tf.constant(0.000001)
                    detCov = tf.linalg.det(
                        tf.cast(tf.matmul(data, tf.transpose(data)), tf.float32) / tf.cast(num_samples, tf.float32))
                    normalization = tf.pow(
                        tf.cast((tf.multiply(2., tf.multiply(np.pi, tf.exp(1.0)))), tf.int32),
                        num_dimensions)
                    if detCov == 0:
                        return -np.inf
                    else:
                        return (0.5 * tf.log(
                            epsilon + tf.multiply(tf.cast(normalization, tf.float32), tf.cast(detCov, tf.float32))))

This is what I have done for calculating entropy!

I also figured out to use binning the values and then use that binning to compute entropy. But the problem with binning was that I could not figure out how to bin, for example [0., 0.05, 0,7,0.9] or [0., 0.04, 0.1, 0.5, 0.99]. I got good result with binning in two data set, but one data set does not fit to the binning I had.

So, based on my code with gaussian do you think is there any way I can justify? (I got good results on three datasets). I hope my explanations are not confusing, and again thanks for helping me out on this.

paulbrodersen commented 5 years ago

If your data is not normally distributed, you should not use the Gaussian approximation. It won't work, not even approximately. If the distribution of your values follows any of the distributions in the exponential family of probability distributions, there is a good chance that there is an analytic solution published somewhere. However, if your data comes from the activations of logistic or ReLU units in a neural net, chances are that those will not follow an exponential distribution (I would expect that logistic units have bi-modal activity distributions). Therefor you have to use the nearest neighbour approach.
If you do use the nearest neighbour approach, never bin your data. Binning will put many points at zero distance to each other. At some stage when computing the entropy using nearest neighbour distances, we take the log of the distance. If that distance is zero, your entropy ends up being undefined.

benam2 commented 5 years ago

Then why I got good results on 3 datasets with gaussian even if my data is not normally distributed? Do you think if I go for knn the result differs a lot?

I will definitely change the entropy to knn, but for now, do you think I can just say I have applied entropy on my data regardless of the distribution they follow and the result is good or you think the good result has been something by luck? For now, I need a quick solution and then I will change the entropy to knn.

The main reason I could not write the script for knn is that I was confused about how it really works. because we do not have knn function in tensorflow. so I have to implement the idea behind it myself. Is it exactly like knn in ML? like it compares each instance with other instances and then groups them accordingly?

Thank you :)

paulbrodersen commented 5 years ago

Then why I got good results on 3 datasets with gaussian even if my data is not normally distributed?

I don't know how you measure goodness in your case.

or you think the good result has been something by luck?

Probably.

The main reason I could not write the script for knn is that I was confused about how it really works. because we do not have knn function in tensorflow. so I have to implement the idea behind it myself. Is it exactly like knn in ML? like it compares each instance with other instances and then groups them accordingly?

I strongly suggest you read the paper that I mentioned above, and then the code.

benam2 commented 5 years ago

Sure, I will do. and sorry for taking your time, I have already informed my advisor I got good result and then this happened! which freaks me out.

The good result I meant, I design a framework for topic modeling and in part of the idea, I am using entropy to choose among my features. So when I have my result, F1 score, precision, recall or even document representation with the proposed method I have a very good result even compared to 2019 papers. Now I don't know how to tell my advisor we can not meet the deadline!!!

Thank you btw you helped a lot.

benam2 commented 5 years ago

I finally figured out a way to use the knn in tensorflow, though as might remember I needed to apply knn over each row of my matrix. In that case, seems the knn does not work. So I converted each row to 2d rows of 1*n_dim. In this case always the answer is inf. So let's say I have a matrix like this:

[[0.65, 0.63 ,0.22 ,0.201], [0.3 ,0.51 ,0.1 ,0.2 ], [0.2 ,0.32 ,0. ,0.50], [0.1 ,0.23 ,0.37 ,0.1 ]]

when feeding each row of this to kdd the answer always is inf.

Why is that?

paulbrodersen / entropy_estimators

continuous entropy with KNN #1