Variables having no information are kept when NAs are present

franck-simon commented 3 years ago

When excluding variables having no information, the processing is not the same if NAs are present or not. A non continuous variable with all its values different is considered has not informative (no edge in miic's init) whilst a similar variable with one or more NAs is considered as informative.

Here is an exemple dataset to illustrate this scenario:

honghaoli42 commented 3 years ago

The origin of the problem is located here: https://github.com/miicTeam/miic_R_package/blob/d6daff69fcd26b64638c17f431b97243e48277b5/src/environment.cpp#L37-L44 While the reported behavior comes from the condition levels[i/j] == n_samples, where n_samples takes into account NA values but levels[i/j] does not, I doubt if whole block is correct at all. I don't see why the mutual information is zero for such cases.

vcabeli commented 3 years ago

You're right, the mutual information I(X;Y) is not null when |X|=N, but equal to H(Y).

The intuition is that if every sample of X is unique then we already have maximum entropy and there cannot be any interaction component with Y, thus H(X) = H(X,Y).

If we take the formula for the mutual information :

$I(X,Y)=\sum_{y}\sum_{x}{p_{(X;Y)}(x,y)\log\left(\frac{p_{(X,Y)}(x, y)}{p_X(x)\,p_Y(y)}\right) }$

We see that for each level of x, ${p_{(X,Y)}(x, y)}$ is either 0 or ${p_{(X)}(x)}$ , which is 1/N. If we plug that in we get eventually the formula for H(Y), $\sum_y{p_{(Y)}(y)}\log\left(p_{(Y)}(y)\right)$ .

The point here is that there is no interaction between X and Y, in practice the edge would be deleted by the complexity but we take a shortcut and remove the edge immediately.

miicTeam / miic_R_package

Variables having no information are kept when NAs are present #91