ucl-pond / pySuStaIn

Subtype and Stage Inference (SuStaIn) algorithm with an example using simulated data.
MIT License
112 stars 62 forks source link

Example code for mixture_KDE #47

Closed alexander-ratzan closed 1 year ago

alexander-ratzan commented 1 year ago

I am in the process of applying the mixture_KDE version of Sustain to an external dataset that contains cross-sectional cognitive test score data for several thousand patients. I have been trying to modify the simrun.py function, but I'm running into a few conceptual and technical roadblocks. For one, we don't have control group data, which to my understanding is acceptable for the mixture_KDE model. Without controls, I'm wondering if the random assignment approach from simrun.py for generating ground_truth_sequences, ground_truth_subtypes, ground_truth_stages_control, and ground_truth_stages_other is the appropriate first step? In the pySustain white paper it says,

Within simrun.py, simulated subjects assigned earliest stages are used as controls and those in latest stages as cases.

I don't quite understand how this transfers onto applying Sustain on real data?

In my script, after generating the random ground truth sequences etc., I comment out this line data, data_denoised = generate_data_mixture_sustain(ground_truth_subtypes, ground_truth_stages, ground_truth_sequences, sustainType) and use the numpy array of my own data, which is in the exact same shape as what would be generated by the above line of code. However, I am receiving a LinAlgError: Singular matrix error when running this line: mixtures = fit_all_kde_models(true_data, labels).

I think having a clearer example script of a mixture_KDE implementation with real data would be very useful in helping me answer some of my questions. Please let me know if there are any resources that you could share with me that might be helpful, or if you could address some of my issues directly.

I can also share my current working script if it would be of any help. Thanks!

noxtoby commented 1 year ago

Sorry, but you do need controls to fit a KDE mixture model. This then generates the event likelihoods that are input to the clustering part of SuStaIn.

If you want to run SuStaIn in the absence of controls, you’ll need to create a bespoke model for these event likelihoods.

Good luck.

alexander-ratzan commented 1 year ago

Thanks so much for your helpful response, this makes sense. My team has modified our code to now have a control sample and patient sample. We have been able to run the code up until generating an instance of the SuStaIn model and have actually been able to create an instance with mixture_GMM. However, we are still receiving the below error for the mixture_KDE version, which is more suited to our data.

LinAlgError                               Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_21144\1720235003.py in <module>
----> 1 main(hc_data, patient_data, sustainType)

~\AppData\Local\Temp\ipykernel_21144\1919658236.py in main(hc_data, patient_data, sustain_Type)
    116             mixtures = fit_all_gmm_models(full_data, full_data_labels)
    117         elif sustainType == "mixture_KDE":
--> 118             mixtures = fit_all_kde_models(full_data, full_data_labels)
    119 
    120         print(mixtures)
...
...
LinAlgError: Singular matrix

I'd be happy to provide the full error if that would be helpful. Please advise on how to progress.

Thanks

noxtoby commented 1 year ago

I don't know how you modify your code to produce a control sample from the same data. Putting that aside...

A singular matrix error often implies that the mixture modelling is labelling everyone (cases and controls) to be in one component (pre-event/post-event, a.k.a., normal/abnormal).

This can happen when your cases and controls histograms overlap too much for any feature. Such a feature doesn't have useful "disease signal" and should probably be excluded from your feature set (and shouldn't be called a biomarker).

Another possibility is that you have input an incorrect "disease direction" while also "fixing" controls to not swap labels, i.e., to stay as pre-event/normal.