miicTeam / miic_R_package

Learning causal or non-causal graphical models using information theory
GNU General Public License v3.0
26 stars 3 forks source link

Variables having no information are kept when NAs are present #91

Closed franck-simon closed 3 years ago

franck-simon commented 3 years ago

When excluding variables having no information, the processing is not the same if NAs are present or not. A non continuous variable with all its values different is considered has not informative (no edge in miic's init) whilst a similar variable with one or more NAs is considered as informative.

Here is an exemple dataset to illustrate this scenario:

image

honghaoli42 commented 3 years ago

The origin of the problem is located here: https://github.com/miicTeam/miic_R_package/blob/d6daff69fcd26b64638c17f431b97243e48277b5/src/environment.cpp#L37-L44 While the reported behavior comes from the condition levels[i/j] == n_samples, where n_samples takes into account NA values but levels[i/j] does not, I doubt if whole block is correct at all. I don't see why the mutual information is zero for such cases.

vcabeli commented 3 years ago

You're right, the mutual information I(X;Y) is not null when |X|=N, but equal to H(Y).

The intuition is that if every sample of X is unique then we already have maximum entropy and there cannot be any interaction component with Y, thus H(X) = H(X,Y).

If we take the formula for the mutual information :

We see that for each level of x, is either 0 or , which is 1/N. If we plug that in we get eventually the formula for H(Y), .

The point here is that there is no interaction between X and Y, in practice the edge would be deleted by the complexity but we take a shortcut and remove the edge immediately.