Closed alexander-ratzan closed 1 year ago
Sorry, but you do need controls to fit a KDE mixture model. This then generates the event likelihoods that are input to the clustering part of SuStaIn.
If you want to run SuStaIn in the absence of controls, you’ll need to create a bespoke model for these event likelihoods.
Good luck.
Thanks so much for your helpful response, this makes sense. My team has modified our code to now have a control sample and patient sample. We have been able to run the code up until generating an instance of the SuStaIn model and have actually been able to create an instance with mixture_GMM. However, we are still receiving the below error for the mixture_KDE version, which is more suited to our data.
LinAlgError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_21144\1720235003.py in <module>
----> 1 main(hc_data, patient_data, sustainType)
~\AppData\Local\Temp\ipykernel_21144\1919658236.py in main(hc_data, patient_data, sustain_Type)
116 mixtures = fit_all_gmm_models(full_data, full_data_labels)
117 elif sustainType == "mixture_KDE":
--> 118 mixtures = fit_all_kde_models(full_data, full_data_labels)
119
120 print(mixtures)
...
...
LinAlgError: Singular matrix
I'd be happy to provide the full error if that would be helpful. Please advise on how to progress.
Thanks
I don't know how you modify your code to produce a control sample from the same data. Putting that aside...
A singular matrix error often implies that the mixture modelling is labelling everyone (cases and controls) to be in one component (pre-event/post-event, a.k.a., normal/abnormal).
This can happen when your cases and controls histograms overlap too much for any feature. Such a feature doesn't have useful "disease signal" and should probably be excluded from your feature set (and shouldn't be called a biomarker).
Another possibility is that you have input an incorrect "disease direction" while also "fixing" controls to not swap labels, i.e., to stay as pre-event/normal.
I am in the process of applying the mixture_KDE version of Sustain to an external dataset that contains cross-sectional cognitive test score data for several thousand patients. I have been trying to modify the simrun.py function, but I'm running into a few conceptual and technical roadblocks. For one, we don't have control group data, which to my understanding is acceptable for the mixture_KDE model. Without controls, I'm wondering if the random assignment approach from simrun.py for generating ground_truth_sequences, ground_truth_subtypes, ground_truth_stages_control, and ground_truth_stages_other is the appropriate first step? In the pySustain white paper it says,
I don't quite understand how this transfers onto applying Sustain on real data?
In my script, after generating the random ground truth sequences etc., I comment out this line
data, data_denoised = generate_data_mixture_sustain(ground_truth_subtypes, ground_truth_stages, ground_truth_sequences, sustainType)
and use the numpy array of my own data, which is in the exact same shape as what would be generated by the above line of code. However, I am receiving a LinAlgError: Singular matrix error when running this line:mixtures = fit_all_kde_models(true_data, labels)
.I think having a clearer example script of a mixture_KDE implementation with real data would be very useful in helping me answer some of my questions. Please let me know if there are any resources that you could share with me that might be helpful, or if you could address some of my issues directly.
I can also share my current working script if it would be of any help. Thanks!