Closed franck-simon closed 3 years ago
The origin of the problem is located here:
https://github.com/miicTeam/miic_R_package/blob/d6daff69fcd26b64638c17f431b97243e48277b5/src/environment.cpp#L37-L44
While the reported behavior comes from the condition levels[i/j] == n_samples
, where n_samples
takes into account NA values but levels[i/j]
does not, I doubt if whole block is correct at all. I don't see why the mutual information is zero for such cases.
You're right, the mutual information I(X;Y)
is not null when |X|=N
, but equal to H(Y)
.
The intuition is that if every sample of X
is unique then we already have maximum entropy and there cannot be any interaction component with Y
, thus H(X) = H(X,Y)
.
If we take the formula for the mutual information :
We see that for each level of x
, is either 0 or , which is 1/N
. If we plug that in we get eventually the formula for H(Y)
, .
The point here is that there is no interaction between X
and Y
, in practice the edge would be deleted by the complexity but we take a shortcut and remove the edge immediately.
When excluding variables having no information, the processing is not the same if NAs are present or not. A non continuous variable with all its values different is considered has not informative (no edge in miic's init) whilst a similar variable with one or more NAs is considered as informative.
Here is an exemple dataset to illustrate this scenario: