ucl-pond / pySuStaIn

Subtype and Stage Inference (SuStaIn) algorithm with an example using simulated data.
MIT License
112 stars 62 forks source link

Fix for "rare" divide by zero problem #28

Open illdopejake opened 2 years ago

illdopejake commented 2 years ago

Dear SuStaIn friends,

I had encountered an error on a number of occasions when using SuStaIn on different datasets. The error itself looked something like this:

Traceback (most recent call last):
  File "/Users/jacobv/SuStaIn_workshop/lib/python3.7/site-packages/multiprocess/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/Users/jacobv/SuStaIn_workshop/lib/python3.7/site-packages/multiprocess/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/Users/jacobv/SuStaIn_workshop/lib/python3.7/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "/Users/jacobv/SuStaIn_workshop/lib/python3.7/site-packages/pySuStaIn/AbstractSustain.py", line 709, in _find_ml_iteration
    _                               = self._perform_em(sustainData, seq_init, f_init, rng)
  File "/Users/jacobv/SuStaIn_workshop/lib/python3.7/site-packages/pySuStaIn/AbstractSustain.py", line 864, in _perform_em
    candidate_likelihood            = self._optimise_parameters(sustainData, current_sequence, current_f, rng)
  File "/Users/jacobv/SuStaIn_workshop/lib/python3.7/site-packages/pySuStaIn/ZscoreSustain.py", line 324, in _optimise_parameters
    this_S                      = this_S[0, :]
IndexError: index 0 is out of bounds for axis 0 with size 0.

As it turns out, this is caused by a divide by zero problem during the normalization of p_perm_k. This NaN then propagates forward a bit and doesn't turn up as an error until line 324, as shown. This is not itself necessarily caused by any outlying "bad values" (e.g. NaNs) in the original dataset, so it's quite hard (impossible?) to detect before running SuStaIn and getting the error.

This is apparently a known issue, as the following comment exists on line 333 of ZScoreSustain: #adding 1e-250 fixes divide by zero problem that happens rarely

A few lines later at 335, the "corrected" line occurs: p_perm_k_norm = p_perm_k_weighted / np.sum(p_perm_k_weighted + 1e-250, axis=(1, 2), keepdims=True)

However, at least in my case, the offending divide by zero problem occurred earlier. Note that the fix (ln 335) occurs before the error in my traceback (ln 324). Instead, the divide by zero problem occurs for me at line 238, which is incidentally the same calculation: p_perm_k_norm = p_perm_k_weighted / np.sum(p_perm_k_weighted, axis=(1,2), keepdims=True)

By once again adding the "corrected" line, the problem is surmounted and I no longer get the error. I'm not sure how rare this issue really is, because this is maybe the third time I've encountered it (on different datasets). Requesting a patch to fix it, pretty please!

Thanks as always for this incredible software!! <3 <3 <3

noxtoby commented 2 years ago

Cheers @illdopejake. I've come across a similar error. @LeonAksman and I added the hacky fix, I believe.

Yours could be fixedhacked by adding the infinitesimal amount to p_perm_k_weighted on line 257 and/or line 331 of ZscoreSustain.py.

This is somewhat unsatisfying since the error seems to be deeper than this — given that p_perm_k_weighted is calculated by ZscoreSustain's _calculate_likelihood_stage()

I don't currently have bandwidth to dig into _calculate_likelihood_stage(), but would happily try to help someone if they have a go first.

sea-shunned commented 2 years ago

Hacky fix extended to other lines in https://github.com/ucl-pond/pySuStaIn/commit/cec24a2d45835d495b43c2b98f36141ff963f93a. Tests still pass, just FYI.

I agree the error may be deeper than a numerical stability issue...Potentially a good candidate for something to look at in a pySuStaIn hackathon.

illdopejake commented 2 years ago

Dear SuStaIn heroes:

Having weird bugs again, maybe related to the above? This time I am using an old dataset that the original SuStaIn worked fine with, but am here using the newest version of SuStaIn. I had to change several lines to get it to work, and I really had no idea what I was doing, but I wanted to bring it to your attention.

First, I am getting a bunch more divide by zero warnings, specifically in lines 555, 556 and 559 of AbstractSustain.py. In particular, line 556 seemed to be "fatal" during debugging, so I used the previous strategy above, and changed it to:

total_prob_subtype_norm = total_prob_subtype / ((np.tile(np.sum(total_prob_subtype, 1).reshape(len(total_prob_subtype), 1), (1, N_S))) + 1e-250)

Next, I was getting a dimension error around lines 577 and lines 584. The structure of the code seems to anticipate this:

try:
    ml_subtype[i]           = this_subtype
except:
    ml_subtype[i]           = this_subtype[0][0]

The problem is, the exception was not getting caught. To get around this, I changed the try/except into an if/else statement:

if type(this_subtype) == tuple:
    ml_subtype[i]           = this_subtype[0][0]
else:
    ml_subtype[i]           = this_subtype

I then had to do something kind of similar a few lines below that at line 584:

if type(this_subtype) == tuple:
    prob_ml_subtype[i]  = this_prob_subtype
else:
    prob_ml_subtype[i]  = this_prob_subtype[this_subtype]

Altogether, these completely uninformed and pathetic hacks seemed to "solve" the issue -- SuStaIn ran with no issue after that. But they are perhaps suggestive of some other little gotchas lingering in the code, and I have no idea why these issues occurred.

I'm wondering if the reason I keep running into these issues is because I'm using data with really large Z scores. Because PET is rather sensitive, it's not uncommon to get Z-scores above 30 for example when normalizing to controls. Could that perhaps explain what's going on? Do ya'll ever run tests/simulations with big z-scores like that?

Thanks friends!

--Jake

noxtoby commented 2 years ago

Hey @illdopejake — thanks as ever for your detailed feedback.

I reckon you've hit the nail on the head with the high-z-scores as I've had similar weirdness to you. Usually I find that the Z_max parameter of ZscoreSustain fixes it: must be larger than the actual data.

I believe that there aren't any tests/simulations on this.

Being more of a MixtureSuStaIn guy myself, I'd hope that one of the Zscore gurus here could have a look (@ayoung11 obviously at the top of that list).

ayoung11 commented 2 years ago

I'm on the case! My guess would be that it's a precision problem that causes the likelihood to go to zero for z-scores that are very far from the z-scores that have been included in z_vals and z_max, but will let you know when I get to the bottom of it.

katrinaCode commented 3 months ago

Hi all,

I've found the source and fix for this error -- numpy runs out of memory in float64 and returns 0 if the argument of np.exp() is less than ~-700. This means that p_perm_k = np.exp(p_perm_k) in line 239 in_calculate_likelihood_stage will be all 0s if p_perm_k is all < -700, and this then propagates into NaNs starting with the f_opt calculation in line 262 in _optimise_parameters.

The fix is to ensure p_perm_k is a longdouble in _calculate_likelihood_stage. This increases the precision to allow for arguments of up to ~-11300 in np.exp().

I can't comment on why p_perm_k is becoming so negative and what that indicates, maybe the devs will have some idea? I personally usually ran into this error in later stages of the solve (i.e. on cluster 4 of 6, never usually on the first one or two splits).

I will post my fix below of my entire working _calculate_likelihood_stage function, starting at line 209 as everything above line 212 is unchanged. It contains extra checks/failsafes/print statements and some debugging notes, and could be simplified.

The fix is in ensuring p_perm_k, factor, coeff, and x are all initialized/defined as longdoubles, and double-checking that p_perm_k is a longdouble before performing np.exp(p_perm_k). Then, there are some checks to see if p_perm_k is all zeros after np.exp(p_perm_k) (which it shouldn't be after implementing the fix unless it's run out of memory even using longdoubles) which is what causes the NaNs later on and ultimately the IndexError.

def _calculate_likelihood_stage(self, sustainData, S):
        ''' ... lines 164 to 209

        stage_value                         = 0.5 * point_value[:, :point_value.shape[1] - 1] + 0.5 * point_value[:, 1:]

        M                                   = sustainData.getNumSamples()   #data_local.shape[0]

        # fix starts here!
        # p_perm_k is initialized as all 0s.
        p_perm_k                            = np.zeros((M, N + 1), dtype="longdouble")
        if p_perm_k.dtype != "longdouble":
            print("p_perm_k dtype after being initialized: ", p_perm_k.dtype)

        # optimised likelihood calc - take log and only call np.exp once after loop
        sigmat = np.array(self.std_biomarker_zscore)

        factor                              = np.log(1. / np.sqrt(np.pi * 2.0) * sigmat).astype("longdouble")
        coeff                               = np.log(1. / float(N + 1)).astype("longdouble")

        # original
        """
        for j in range(N+1):
            x                   = (data-np.tile(stage_value[:,j],(M,1)))/sigmat
            p_perm_k[:,j]       = coeff+np.sum(factor-.5*x*x,1)
        """
        # faster - do the tiling once
        # stage_value_tiled                   = np.tile(stage_value, (M, 1))
        # N_biomarkers                        = stage_value.shape[0]
        # for j in range(N + 1):
        #     stage_value_tiled_j             = stage_value_tiled[:, j].reshape(M, N_biomarkers)
        #     x                               = (sustainData.data - stage_value_tiled_j) / sigmat  #(data_local - stage_value_tiled_j) / sigmat
        #     p_perm_k[:, j]                  = coeff + np.sum(factor - .5 * np.square(x), 1)
        # p_perm_k                            = np.exp(p_perm_k)

        # even faster - do in one go
        x = ((sustainData.data[:, :, None] - stage_value) / sigmat[None, :, None]).astype("longdouble")

        # debug notes
        # here we are adding a constant (coeff), so p_perm_k should not be all 0s.
        # since log(#) is always < 0, coeff is negative, and I have not yet found any 
        # instances of np.sum(factor[None, :, None] - 0.5 * np.square(x), 1) being positive.
        # it then calls to question how p_perm_k could be all 0 if its maximum threshold is 
        # less than 0. 
        # its maximum threshold should be coeff, i.e. log(1/7) if N_S_max = 6

        # turns out, it is all 0 due to memory overflow.
        p_perm_k = coeff + np.sum(factor[None, :, None] - 0.5 * np.square(x), 1)

       # double checking the dtype, because I'm paranoid
        if p_perm_k.dtype != "longdouble":
            print("p_perm_k dtype after calc: ", p_perm_k.dtype)

        if (p_perm_k < -700).all():
            print("p_perm_k very large negative number, so exping it will be 0 with float64 dtype.", p_perm_k.dtype, np.mean(np.exp(p_perm_k)))
            if p_perm_k.dtype != "longdouble":
                p_perm_k = p_perm_k.astype("longdouble")
                print("changed p_perm_k's dtype to longdouble in if.", p_perm_k.dtype)

            p_perm_k = np.exp(p_perm_k)
            if np.mean(abs(p_perm_k)) == 0:
                print("p_perm_k all 0 in if statement")

        else:
            p_perm_k = np.exp(p_perm_k)
            if p_perm_k.dtype != "longdouble":
                p_perm_k = p_perm_k.astype("longdouble")
                print("changed p_perm_k's dtype to longdouble in else.", p_perm_k.dtype)

            if np.mean(abs(p_perm_k)) == 0:
                print("p_perm_k all 0 in else statement")

        # this should not be triggered if the above fixes work.
        if np.mean(abs(p_perm_k)) == 0:
            print(sum(sum(p_perm_k)), np.shape(p_perm_k))
            print("\np_perm_k is all 0")

        return p_perm_k


Let me know if this fixes things on your end, as it resolved the issue for me.

Thanks!