Divide by zero and overflow bug

gregversteeg / bio_corex

A flexible version of CorEx developed for bio-data challenges that handles missing data, continuous/discrete variables, multi-CPU, overlapping structure, and includes visualizations

Apache License 2.0

139 stars 29 forks source link

X = np.array([[0,0,0,1,1], # A matrix with rows as samples and columns as variables. [0,0,0,1,1], [0,1,1,0,0], [0,1,1,0,0]]) layer1 = Corex(n_hidden=2, dim_hidden=2, marginal_description='gaussian', smooth_marginals=False, n_repeat=10, n_cpu=10 ) layer1.fit(my_data) # Fit on data.

Hi Greg. I took another look at this issue. As mentioned here: https://github.com/gregversteeg/bio_corex/pull/20#issuecomment-770765686 a sig_ml.clip of 0.25 may not be sensible.

Thinking about the problem further, this error occurs when corex assigns a selection of datapoints to a given n_hidden by dim_hidden combination such that each of the datapoints has the identical value - this is required for sig_ml to be calculated =0 in estimate_parameters. This also means that the marginals for those datapoints =0, and therefore the log(marginal) is undefined, since log(0) = -inf.

In the code this happens in the marginal_p function line: return (-(xi - mu)**2 / (2. * sig) - 0.5 * np.log(2 * np.pi * sig)).transpose((1, 0, 2)) # log p(xi|yj) where sig = 0 causes a divide-by-zero in the first term, and a log(0) in the second term.

However, thinking about this for a while, it seems to me that while leads a computational error, a selection of datapoints with the same value is not a gaussian distribution and so the real problem is that the selected data does not fit the model. Furthermore, this is not likely to happen with actually continuous gaussian data, since its unlikely that multiple datapoints would have the same value, however is much more likely to happen when binary data is used with the 'gaussian' marginal_distribution as in the example above, or if for example a categorial variable is coded as 0, 1 alongside truly continuous/gaussian data.

So, perhaps the solution here is not to clip sig at all (since I'm not sure one can define a sig.clip that would be valid for all datasets), but to transfer or exclude problematic variables in the data?

gregversteeg / bio_corex

Divide by zero and overflow bug #19