gregversteeg / bio_corex

A flexible version of CorEx developed for bio-data challenges that handles missing data, continuous/discrete variables, multi-CPU, overlapping structure, and includes visualizations
Apache License 2.0
137 stars 30 forks source link

Negative TC results #21

Closed jahanpd closed 4 years ago

jahanpd commented 4 years ago

Hi. My understanding of the TC is that it is guaranteed to be non-negative. However, I am having some unusual results when combining binary and continuous variables as described in a previous issue.

As an example, when I run the following code:

X = np.array( [[0,0,0,0,4.0], # A matrix with rows as samples and columns as variables. [0,0,0,1,26.0], [0,1,1,0,6.0], [1,0,1,1,30.0]], dtype=int)

layer1 = ce.Corex(n_hidden=2, dim_hidden=2, marginal_description='gaussian', n_repeat=10,verbose=1, seed=1)

layer1.fit(X) # Fit on data

VERBOSE OUTPUT:

... Overall tc: -32.915963736593795

... Overall tc: Overall tc: -5.596481067514456e-07

... Overall tc: -22.582697642198013

... Best tc: -5.596481067514456e-07

I can only conclude that there must be something funny going on when the marginal probabilities are modeled. As previously stated "The way the marginal probabilities are modeled in this case (with mixtures of Gaussians around each binary value) should be equivalent to modeling them as binary."

I also get negative TCs when running certain combinations of binary variables with gaussian marginals turned on.

I'm not sure how to interpret the negative TCs in this context. Any help would be appreciated.

Best wishes

gregversteeg commented 4 years ago

You're absolutely right, it shouldn't be negative and it's a sign of the marginal distribution being estimated poorly. Please see this pull request to see if it fixes the issue: https://github.com/gregversteeg/bio_corex/pull/20 I was planning to merge it in but haven't had time to test it.

jahanpd commented 4 years ago

Hi Greg,

Thanks for your response. Implementing the clipping was very effective!

I shall close the issue.