gregversteeg / bio_corex

A flexible version of CorEx developed for bio-data challenges that handles missing data, continuous/discrete variables, multi-CPU, overlapping structure, and includes visualizations
Apache License 2.0
137 stars 30 forks source link

Divide by zero and overflow bug #19

Open jpkrooney opened 4 years ago

jpkrooney commented 4 years ago

Hi. I'm seeing divide by zero warning and overflow warning under certain circumstances. It can happen if you have a data column with all zeros and run Corex with a gaussian marginal and smooth marginals turned off. Eg:

X = np.array([[0,0,0,1,1], # A matrix with rows as samples and columns as variables.
              [0,0,0,1,1],
              [0,1,1,0,0],
              [0,1,1,0,0]])

layer1 = Corex(n_hidden=2, dim_hidden=2, marginal_description='gaussian', smooth_marginals=False, n_repeat=10, n_cpu=10 )  
layer1.fit(my_data)  # Fit on data. 

This produces following output (this doesn't happen every run depending on random numbers):

Screenshot 2020-04-25 at 10 04 47

I had a look through the code and did some testing and this error seems to happen in the marginal_p function when sig = 0. Therefore as proposed solution I suggest the clip the minimum value of sig_ml to 0.5 in the estimate parameters function. I don't have a good a-priori reason to pick the value 0.5, however it seems to work well in tests. I have enacted this solution in this fork: https://github.com/jpkrooney/bio_corex

Note also that the issue seems to also allow negative TCS values to occur:

Screenshot 2020-04-25 at 10 18 08

After implementing the 0.5 mimimum value for sig:

Screenshot 2020-04-25 at 10 19 38

The second run with the 0.5 limit in place also converges in less iterations.

I will submit my fork for a PR but I encourage you to do more testing also! Greg, I wonder if this might partly explain the issues you noticed with zero-inflated data ?

jpkrooney commented 3 years ago

Hi Greg. I took another look at this issue. As mentioned here: https://github.com/gregversteeg/bio_corex/pull/20#issuecomment-770765686 a sig_ml.clip of 0.25 may not be sensible.

Thinking about the problem further, this error occurs when corex assigns a selection of datapoints to a given n_hidden by dim_hidden combination such that each of the datapoints has the identical value - this is required for sig_ml to be calculated =0 in estimate_parameters. This also means that the marginals for those datapoints =0, and therefore the log(marginal) is undefined, since log(0) = -inf.

In the code this happens in the marginal_p function line: return (-(xi - mu)**2 / (2. * sig) - 0.5 * np.log(2 * np.pi * sig)).transpose((1, 0, 2)) # log p(xi|yj) where sig = 0 causes a divide-by-zero in the first term, and a log(0) in the second term.

However, thinking about this for a while, it seems to me that while leads a computational error, a selection of datapoints with the same value is not a gaussian distribution and so the real problem is that the selected data does not fit the model. Furthermore, this is not likely to happen with actually continuous gaussian data, since its unlikely that multiple datapoints would have the same value, however is much more likely to happen when binary data is used with the 'gaussian' marginal_distribution as in the example above, or if for example a categorial variable is coded as 0, 1 alongside truly continuous/gaussian data.

So, perhaps the solution here is not to clip sig at all (since I'm not sure one can define a sig.clip that would be valid for all datasets), but to transfer or exclude problematic variables in the data?