paulbrodersen / entropy_estimators

Estimators for the entropy and other information theoretic quantities of continuous distributions
GNU General Public License v3.0
129 stars 26 forks source link

Process finished with exit code -1073741571 (0xC00000FD) #6

Closed zhangyue233 closed 3 years ago

zhangyue233 commented 3 years ago

while I run "get_h()', the program exit automatically, and print "Process finished with exit code -1073741571 (0xC00000FD)",

paulbrodersen commented 3 years ago

I am skeptical that this issue has anything to do with my code. Per this list, the error code is a notification on Windows systems of a stackoverflow. get_h is not computing the entropy recursively, so I can't see how it would cause the stackoverflow. Can you provide a minimal, reproducible example, that produces the error in a clean virtual environment (i.e. only containing the dependencies of this module)?

Also, if you google around, a lot of people seem to be running into this error code when using pycharm, especially in combination with Qt and/or tensorflow (or packages built on top of tensorflow such as keras). Are you using any of these programs/packages?

paulbrodersen commented 3 years ago

To be clear, I am not ruling out completely that my code is at fault, I am just saying that I need a lot more evidence to convince me.

zhangyue233 commented 3 years ago

yes, i figured out why fun get_h does't work well, "sum_log_dist = np.sum(log(2*distances))"is a line in function get_h, in my data, some samples have identity values, which leads to some elements in distances become zeros, the the sum_log_dist gets a value of '-inf', then, the following codes with run into error.

zhangyue233 commented 3 years ago

by far ,i have no idea how to solve this situation, i decide to read up the original paper, could you give me some advice? Thanks!

paulbrodersen commented 3 years ago

get_h has a min_dist parameter, which when set to non-zero should circumvent your issue (distances between points smaller than min_dist are capped to min_dist such that points with the same coordinates are forced to have non-zero distances to each other). A principled choice for min_dist is half of your measurement precision, typically the minimum non-zero nearest-neighbour distance in your dataset.

zhangyue233 commented 3 years ago

thanks for your guidance. while I test get_h and get_h_mvn on my data for feature selection, i found get_h_mvn works ideally, the calculated entropy values consist with the intuitive observation of feature data. Especially, one feature in fact is discrete, the entropy calculated from get_h_mvn is close to the entropy from standard information entropy equation for discrete variable. However, the get_h performs awfully, firstly, the entropy values calculated are counterintuitive and i also ranked the features based on the entropy, the rank from get_h and get_h_mvn have great difference, second, one feature which values are composed of {0.0: 7950, 0.0003636: 1, 0.0263157: 1}, while runing for this feature, get_h_mvn is stuck at this line" kdtree = cKDTree(x)", the python stop and print "Process finished with exit code -1073741571 (0xC00000FD)"

zhangyue233 commented 3 years ago

I'm working on feature selection for my project, in my situation, the features are used for clustering and have little labeled data. I have tried information entropy, Laplacian score as feature filter, do you have experience on feature selection for this scenario?

paulbrodersen commented 3 years ago

entropy calculated from get_h_mvn is close to the entropy from standard information entropy equation for discrete variable

That could be entirely accidental.

while runing for this feature, get_h_mvn is stuck at this line" kdtree = cKDTree(x)", the python stop and print "Process finished with exit code -1073741571 (0xC00000FD)

There is no call to cKDTree in get_h_mvn. It uses the (co-)variance to compute the entropy under the assumption that the samples are drawn from a multivariante normal distribution. Are you sure you are calling get_h_mvn?

zhangyue233 commented 3 years ago

oh,no ,the get_h_mvn works well ,the get_h is stuck at cKDTree

paulbrodersen commented 3 years ago

Ok. How many samples are in your dataset and what values are you using for k and min_dist?

zhangyue233 commented 3 years ago

I used the default k and selected min_dist as your advice, that is, the minimum nonzero distance is selected as min_dist. In fact, these have nothing to do with aforementioned error. you can run the following code with python 3.6, the error will reappear. from scipy.spatial import cKDTree x=[[i] for i in [0]*7950+[ 0.0003636,0.0263157]] tree=cKDTree(np.array(x))

Process finished with exit code -1073741571 (0xC00000FD)

zhangyue233 commented 3 years ago

If there are continuous features and discrete features in my feature set, i need to rank their information entropies for feature filtering. Is it justify to treat all features as continuous variables and evaluate the entropy using get_h_mvn() function? Or, only continuous features' entropies are computed by get_h_mvn() while the discrete ones are calutated by shennon entropy equation , then rank all the entropies? looking forward for your guidance , thanks!

paulbrodersen commented 3 years ago

I can't reproduce your error. If I were you, I would investigate your setup and eventually file a bug report on scipy.

In [1]: %paste
from scipy.spatial import cKDTree
x=[[i] for i in [0]*7950+[ 0.0003636,0.0263157]]
tree=cKDTree(np.array(x))
## -- End pasted text --
In [2]: tree
Out[2]: <scipy.spatial.ckdtree.cKDTree at 0x7f427495f4a8>

Entropy is an extensive property. So no, I don't think that you can compare entropy values for discrete variables with the entropy values for continuous variables. Even within your continuous features such a comparison may be nonsensical.

paulbrodersen commented 3 years ago

Since I haven't heard from you for a week, I will close this issue for now. Feel free to re-open if necessary.