ucl-pond / pySuStaIn

Subtype and Stage Inference (SuStaIn) algorithm with an example using simulated data.
MIT License
135 stars 64 forks source link

ValueError in AbstractSuStaIn #48

Closed katrinaCode closed 1 year ago

katrinaCode commented 1 year ago

Hi all,

Thanks for your help with my past issue!

I'm now encountering a new error within the AbstractSuStaIn package that seems to relate to the staging portion of the algorithm:

Error Traceback `MCMC Iteration: 100%|██████████| 10000/10000 [00:24<00:00, 414.37it/s] MCMC Iteration: 100%|██████████| 10000/10000 [00:23<00:00, 422.43it/s] MCMC Iteration: 100%|██████████| 10000/10000 [00:30<00:00, 325.00it/s] MCMC Iteration: 100%|██████████| 1000/1000 [00:02<00:00, 462.84it/s]`

` [pysustain package location]/lib/python3.10/site-packages/pySuStaIn/AbstractSustain.py:556: RuntimeWarning: invalid value encountered in divide`
` total_prob_subtype_norm = total_prob_subtype / `
`np.tile(np.sum(total_prob_subtype, 1).reshape(len(total_prob_subtype), 1), (1, N_S))`

` [pysustain package location]/lib/python3.10/site-packages/pySuStaIn/AbstractSustain.py:557: RuntimeWarning: invalid value encountered in divide`
` total_prob_stage_norm = total_prob_stage / np.tile(np.sum(total_prob_stage, 1).reshape(len(total_prob_stage), 1), (1, nStages + 1)) #removed total_prob_subtype`

` [pysustain package location]/lib/python3.10/site-packages/pySuStaIn/AbstractSustain.py:560: RuntimeWarning: invalid value encountered in divide`
` total_prob_subtype_stage_norm = total_prob_subtype_stage / `
`np.tile(np.sum(np.sum(total_prob_subtype_stage, 1, keepdims=True), 2).reshape(nSamples, 1, 1),(1, nStages + 1, N_S))`

` Traceback (most recent call last):`
` File "[notebook].py", line 475, in `
` prob_subtype_stage = sustain_input.run_sustain_algorithm()`
` File "[pysustain package location]/lib/python3.10/site-packages/pySuStaIn/AbstractSustain.py", line 186, in run_sustain_algorithm`
` prob_subtype_stage = self.subtype_and_stage_individuals(self.__sustainData, samples_sequence, samples_f, N_samples) #self.subtype_and_stage_individuals(self.__data, samples_sequence, samples_f, N_samples)`

` File "[pysustain package location]/lib/python3.10/site-packages/pySuStaIn/AbstractSustain.py", line 590, in subtype_and_stage_individuals`
` this_prob_stage = np.squeeze(prob_subtype_stage[i, :, int(ml_subtype[i])])`

` ValueError: cannot convert float NaN to integer`

This error is happening both locally and on a remote computing cluster. I've already added in an assert that none of the data going into SuStaIn contains NaNs, and ensured that my Z_vals are all integers (I am using Zscore SuStaIn). Do you have any ideas of what may be causing this issue or how to solve it?

As a related question, my research group and I are wondering why negative z-scores are not allowed in SuStaIn, and best practices handle them. We are currently shifting the z-score distribution to the right to ensure all values are > 0, but this means that we are losing the interpretability of z = 0, etc. Do you have any advice?

Thank you!

noxtoby commented 1 year ago

Advice: only look at positive z-scores and don't shift your data as you've described. Negative Z scores are not of interest to disease progression because they are on the "normal" end of biomarker measurements.

That error is a new one for me. I'll take a guess. It looks like some data cannot be assigned to a ml_subtype (hence NaN). I suspect that it's related to the weird things you've done with your data (shifting z-scores to the right). By including lots of not-abnormal measurements centred around the normal average (z=0), you might be trying to force an unidentifiable multiple-cluster solution. Perhaps a cluster that is essentially noise could result in NaN values for ml_subtype. For sure your resulting subtypes and stages would make little if any sense.

katrinaCode commented 1 year ago

Hi Neil,

Thank you — we have a significant proportion of our data that have negative z-scores (between 20% and 80% depending on biomarker). My understanding of z-scores from a purely mathematical sense is that only z-scores of 0 are "normal", and any other score indicates abnormality. Could you elaborate more to help me understand how negative z-scores are considered "normal" for this application, and do you have any suggestions to minimize data loss since we have so much negative data?

Thank you for your suggestion, I've tried running the notebook again without the z-score shift and setting all negative zscores to be = 0. I had done this in a previous notebook that ran without issue. The error unfortunately persists.

Thanks again!

noxtoby commented 1 year ago

It's explained in the methods section of original SuStaIn paper. And the tutorial notebooks in this repo.

In brief:

katrinaCode commented 1 year ago

Hi Neil, thank you!

Yes, I've read the paper(s) and the tutorials extensively but still need the clarification so thank you for your comment. I understand that z>0 is defined as the abnormal direction, however we still mathematically end up with some z-scores that are negative (which are strictly necessary, e.g. if the control mean and standard deviation are 0 and 1 respectively, as in "Preparing Data for SuStaIn") so I am trying to understand how to best handle those. For context, we are following a previous paper that included their cognitively normal/control population in the Sustain input, which is why we are not excluding patients within the control distribution.

If the distribution shouldn't be shifted nor the negative values zeroed out or removed, I am not sure what other options there are. My understanding from the literature was that SuStaIn could not accept negative inputs, but does your last point imply that data can be negative, as long as Z_vals are strictly positive? I had been looking into this independently and saw in Zscore Sustain line 69 that it specifically states that the z-scores need to be positive. Apologies if I am misinterpreting and thank you for taking the time to explain.

The same ValueError still occurs after removing the z-score shift as suggested, and when running with only positive values (negative scores zeroed, which led to complete runs in previous versions of the notebook, though I agree with your point about artificially modifying). I also ran without zeroing out negative z-scores as per my interpretation of your comment (so there were negative scores within data), with the same error. The issue is not resolved. Do you have any other suggestions?

Thank you for your help!

noxtoby commented 1 year ago

Correct — it's the direction of abnormality that needs to be positive, not the values. Of course the data (both from patients and controls) will contain negative values after z-scoring. Apologies on behalf of the developers, but — unless I'm mistaken — the comment in line 69 of ZscoreSustain.py is poorly worded and should mean that the direction of abnormality needs to be positive.

I'm almost certain that the hyperparameter event thresholds in Z_Vals all need to be positive. Even if not (but I've never tested the code with negative z-score events), negative event thresholds make no sense as this would be in the opposite direction to disease progression. Why would anyone care if a disease subtype has an early event that amounts to biomarker abnormality that is more normal than the average non-diseased control?

In that spirit, it feels pertinent to state that clustering is not a magical weapon. The user needs to carefully consider the input data and the model. For example, sensible feature selection would exclude features that amount to noise for a given hyperparameter configuration. And hyperparameter configuration needs to respect the available disease signal in your input features. I have a paper in preparation about this.

For context, we are following a previous paper that included their cognitively normal/control population in the Sustain input, which is why we are not excluding patients within the control distribution.

I presume "patients within the controls distribution" means patients with normal-looking measurements. Of course some patients will have measurements well within normal limits, e.g., z<1. Such a biomarker might end up at the end of a data-driven sequence, but it certainly can be included in the model if desired.

And it's fine for the user to include whichever samples they like when training a data-driven model (including measurements from controls), but the model will have to be interpreted with that in mind. If controls will never develop the disease (as is usually, but not always the case), then there's a strong argument for excluding them from the training set.

In my opinion pySuStaIn is not the primary source of your issue. A little forethought is needed in terms of data science / machine learning good practice, e.g., feature selection and hyperparameter tuning as mentioned above. If you don't have a data scientist or statistician on your team, I suggest finding one or two.

katrinaCode commented 1 year ago

Hi Neil,

Thank you for taking the time to respond, and for clarifying the positive z-scores comment from the code. We had been struggling to find a reason why the data should not contain negative scores and confirming that it can is very helpful. I apologize if there was ambiguity in my previous comment, I had never been using negative Z_vals, and I understand their interpretation and why they must be positive.

I appreciate your comments and feedback and will be discussing the issues you raised with my group. We would appreciate if a second look could be taken at the error, or the issue could be re-opened, as we have ensured the input follows the required constraints as outlined in the literature (and as clarified by you, thank you), and we are still unable to run to completion of the 0 subtype/1 cluster problem due to this ValueError (despite having previous success). It routinely fails with the ValueError on the fourth set of MCMC iterations, no matter how many start points and MCMC iterations are used (tested 10 SPs & 1e3 MCMC iters; 15 SPs & 1e4 MCMCs; and 25 SPs and 1e6 MCMCs).

During this testing, I noticed a strange behaviour: despite using a new console/clearing all variables/restarting Spyder between each test, instead of having the same number of MCMC iterations in each set as expected, when setting N_iterations_MCMC = int(1e6) the first three sets of MCMC iterations had 1e4 iterations and the fourth had 1e6, i.e.: MCMC Iteration: 0%| | 0/10000 [00:00<?, ?it/s]
MCMC Iteration: 0%| | 0/10000 [00:00<?, ?it/s]
MCMC Iteration: 0%| | 0/10000 [00:00<?, ?it/s]
MCMC Iteration: 0%| | 0/1000000 [00:00<?, ?it/s]
[... ValueError: cannot convert float NaN to integer] (The Spyder console does not show the tqdm updates to the MCMC iters, so it only shows 0%, that is not the issue). So, there are two issues/behaviours occurring at the fourth set. I'd be happy to open a new issue for this MCMC iterations behaviour if that would be helpful. The model is not loading in from any previous solutions.

Thank you!

noxtoby commented 1 year ago

Give a debugging tool a go. If that doesn't help you to isolate the source of the NaN, then I'm happy to reopen the issue.

But we probably can't help much on this end (having not seen this error before), unless you provide a minimal working example that reproduces the error. In this case, a MWE probably requires the data, or synthetic data that closely resembles the real data (and reproduces the error, of course).

Tonnar commented 8 months ago

@katrinaCode Hello - I am currently having a similar issue did you ever find a workaround?

katrinaCode commented 8 months ago

Hi @Tonnar, yes! @KangMSPeter from my lab created this fix that has worked for both of us:

  1. add 1e-250 in the denominator of the total_prob_subtype_norm calculation to prevent dividing by 0 on line 556 in the subtype_and_stage_individuals function in AbstractSuStaIn, like so:

    total_prob_subtype_norm         = total_prob_subtype        / ((np.tile(np.sum(total_prob_subtype, 1).reshape(len(total_prob_subtype), 1),        (1, N_S))) + 1e-250)


  2. Then, replace the try/if-else statements from lines 578 to 588 in the subtype_and_stage_individuals function in AbstractSuStaIn with this:

    try:
    ml_subtype[i]           = this_subtype
    except:
    ml_subtype[i]           = this_subtype[0][0]             
    if this_prob_subtype.size == 1:
    if this_prob_subtype == 1:
        prob_ml_subtype[i]  = 1
    else:
        prob_ml_subtype[i]  = this_prob_subtype
    else:
    try:
        prob_ml_subtype[i]  = this_prob_subtype[this_subtype]
    except:
        prob_ml_subtype[i]  = this_prob_subtype[this_subtype[0][0]]

I hope this helps!

Tonnar commented 8 months ago

@katrinaCode Thank you so much for your response! This looks like it will help a ton!