gregversteeg / bio_corex

A flexible version of CorEx developed for bio-data challenges that handles missing data, continuous/discrete variables, multi-CPU, overlapping structure, and includes visualizations
Apache License 2.0
137 stars 30 forks source link

Fix for masked array bug, and change sig_ml to tiny value #27

Open jpkrooney opened 3 years ago

jpkrooney commented 3 years ago

Hi Greg,

I'm suggesting two edits with this PR:

  1. To intentionally remove the numpy mask from xi in the marginal_p function when the gaussian option is used. By extracting the data explicitly, we can avoid the issue caused by the numpy bug detailed here: https://github.com/numpy/numpy/issues/18744
  2. Change the value of sig_ml to very small value (e.g. 1e-200). This still avoids the divide by zero issue, but allows biocorex to explore the full parameter space as determined by the data. Note that one side-effect of this is that negative TCS can result on occasion. This happens when the gaussian marginal description on data that is not truly gaussian - for example if a categorical variable is included this can generate a negative TCS. Thus, a negative TCS is an indication that at least some of the variables in the data don't have a gaussian distribution.

It would be great if you could try to code on datasets you know well.