gregversteeg / bio_corex

A flexible version of CorEx developed for bio-data challenges that handles missing data, continuous/discrete variables, multi-CPU, overlapping structure, and includes visualizations
Apache License 2.0
137 stars 30 forks source link

understanding Corex and interpreting the results #8

Open pocin opened 5 years ago

pocin commented 5 years ago

I am trying to understand the aglorithm and how to interpret the results. Can you, please, help me with that?

Imagine following scenario

this is captured in a table like this

harry_1     harry_2     sally_1     sally_2     sally_3
0   1   1   0   0
0   0   0   1   0
1   1   1   1   1
1   0   0   1   1
...

ie. varaibles harry_2 and sally_1 are conditionally dependent and there should be a clear relationship between them. All other pairs should be unrelated.

But when I run

def generate_data():
    coin = bernoulli(p=0.5)
    throws = coin.rvs((5000, 4))
    harry = throws[:, :2]
    sally = throws[:, 1:]
    return np.concatenate((harry, sally), axis=1)

data = pd.DataFrame(
    generate_data(),
    columns=['harry_1', 'harry_2', 'sally_1', 'sally_2', 'sally_3'])

cd = Corex(
    n_hidden=2, 
    dim_hidden=2, 
    marginal_description='discrete',
    verbose=False)

cd.fit(data)

cd.tcs

# array([ 0.694, -0.   ])

cd.clusters
gives different, seemingly random, results each time, for example array([0, 0, 0, 1, 1])

but I would expect [0,1,1,0,0] or [1,0,0,1,1] all the time (indicating that variables 1, 2 (harry_2 and sally_1) have a common latent variable.

or did I completely miss the point here and the example is badly chosen and can't be modeled using corex at all?

pocin commented 5 years ago

aha! when I run corex with n_hidden=5, dim_hidden=2 10 times
total correlations are the same each run[ 0.693, 0. , -0. , -0. , -0. ], but I get clusters

[3 0 0 2 1]
[2 0 0 1 1]
[1 0 0 2 1]
[1 0 0 2 1]
[2 0 0 2 3]
[1 0 0 2 3]
[2 0 0 1 3]
[3 0 0 2 1]
[3 0 0 1 2]
[1 0 0 1 1]

which seems more realistic, the zero column 1, 2 are in the same cluster in all runs and the remaining variables seem to be random which corresponds with the 0-ish correlations between the remaining variables. Is that correct?

gregversteeg commented 5 years ago

This makes sense. I would expect the top latent factor to have one bit of TC (units are in nats, so natural log 2 so 0.693 nats is 1 bit) capture the bit of mutual information between Harry 2 and sally 1. I'm surprised it didn't work with just two latent factors. But I can think of one issue. Because there is no other dependence to capture, the other latent factors will be random. Let Z1 be the "correct factor" which equals harry 2 and sally 1. Z2 is some random factor. That means that the MI(harry 1 ; Z2) = 0, but also MI(harry 1; Z1) = 0. So it's hard to tell which cluster to put Harry 1 into. If you look at the mutual information matrix, it might be more obvious that harry 1, sally 2 belong together but everything else is pretty much random. Depending on your application, this could be solved with a different definition for clusters, where some threshold on MI has to be achieved, for instance.

gregversteeg commented 5 years ago

What happens when you put n_hidden = 5, is that each latent factor can pick one of the independent clusters (where all the other clusters have just a single column). This makes the clustering easier.

pocin commented 5 years ago

Thanks greg, that makes sense.

If you look at the mutual information matrix, it might be more obvious that harry 1, sally 2 belong together but everything else is pretty much random

is this the mutual information matrix?

screen shot 2018-08-22 at 09 40 04
gregversteeg commented 5 years ago

Alpha is not the mutual information matrix. It's in corex.mis.

pocin commented 5 years ago

Beautiful, just as you anticipated!

If I can trouble you with few more questions - the corex.mis is not symmetric - do we care only about the upper triangle?

I am still going through the original paper, there is a lot to learn for me in order to fully understand it. Also I’d like to say that I appreciate that you made the code available and are interacting with me here.

On 22 Aug 2018, at 19:16, Greg Ver Steeg notifications@github.com wrote:

Alpha is not the mutual information matrix. It's in corex.mis.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gregversteeg/bio_corex/issues/8#issuecomment-415109109, or mute the thread https://github.com/notifications/unsubscribe-auth/AIPfKPoaBZNQBGF1tXGWYVTUjQ6erijoks5uTZH2gaJpZM4V5gYg.

gregversteeg commented 5 years ago

It represent the mutual information MI(Zj ; Xi) where the j-th row is for the j-th latent factor and the i-th column is for the i-th variable. If you have a different number of variables and n_hidden it will be more obvious that the array is rectangular. (better double check, I could have mixed up j and i)