gregversteeg / bio_corex

A flexible version of CorEx developed for bio-data challenges that handles missing data, continuous/discrete variables, multi-CPU, overlapping structure, and includes visualizations
Apache License 2.0
137 stars 30 forks source link

Low TC with columns of ones #29

Closed HHalva closed 2 years ago

HHalva commented 2 years ago

If I run CorEx with the following matrix as my data:

X = np.array([[1,1,1,0,0],
              [1,1,1,1,1],
              [1,1,1,0,0],
              [1,1,1,1,1]], dtype=int)

I get clusters to be as expected: [1 1 1 0 0] But the total correlations are: [ 6.92147593e-01 -1.76592074e-12]

In particular, the latent variable capturing the columns of ones clearly has very low TC -- I find this problematic as this seems to suggest that there is really only one important latent factor (i.e. the one for cols 3 and 4 of X), which is clearly not the case. Is this expected behaviour? This is especially tricky if I want to create code that automatically selects the number of latent variables for this type of data, since .e.g here it would have chosen n_hidden=1 even though n_hidden=2 seems the obvious solution.

HHalva commented 2 years ago

I suppose this is related to vectors of ones not having correlation defined between each other, but I am not sure how to think about this type of situations. Perhaps I will just have to delete these constant variables (similar to suggestion of deleting columns of zeros)? The downside is that then these variables are deleted from the graphical and other outputs...

gregversteeg commented 2 years ago

Hi, yes, the best idea is to delete any constant columns. The TC for these columns is always zero. Because they are not correlated with anything, no latent factor is necessary to explain the correlation.

For discrete random variables, you can see this by looking at the Shannon entropy of your first column, H(X1) = 0. There is no information in a constant column. The mutual information with anything, I(X1; Y) = H(X1) - H(X1|Y) = 0 - 0 =0 . That's why I say that the TC is always zero (total correlation is multivariate mutual information).

Because constant columns have no information, it's hard for me to see how they would contribute to a visualization. My advice would be to treat them as a special case and display them separately.