ucl-pond / pySuStaIn

Subtype and Stage Inference (SuStaIn) algorithm with an example using simulated data.
MIT License
124 stars 63 forks source link

Two questions in ZscoreSustain #54

Closed xullllllll closed 1 month ago

xullllllll commented 1 month ago

Dear SuStaIn friends, I have a few questions I'd like to ask. First, in ZscoreSustain, does the z_vals and z_max value correlate with the input data? If so, how to set z_vals and z_max according to the input data? Second, should the input data contain only the z_score of the patient or must contain both the z_score of the patient and the Z_score of the healthy control group? Is there any difference between the two data input methods and who has the better effect? Look forward to your answer, thank you.

noxtoby commented 1 month ago

Yes — a balance needs to be struck between a data-driven model and the data upon which it is to be trained. I have a poster at AAIC next week on this.

The pySuStaIn tutorial notebook mentions this, so work through it yourself.

Second — the model your train will reflect the data you put in. If you want a disease progression model, I personally recommend omitting the controls z-score data.

xullllllll commented 1 month ago

Well, thank you very much for your answer. However, in the pySuStaIn tutorial notebook you mentioned, it doesn't say in detail how to set z_vals according to the distribution of biomarkers. Is z_vals= 1,2,3 suitable for all types of data? And, can you explain to me the difference between input data that includes control group and input data that does not include control group?

noxtoby commented 1 month ago
  1. Z_vals. There is no prescribed way, but I would recommend checking the coverage of your data (e.g., if a biomarker never gets to a certain z-score, then don't include that score in Z_vals).
  2. I'm not sure what you need explaining here. With controls data included in the sustain object, the algorithm will produce a progression subtype model that is influenced by data from controls, which by definition is not disease data, therefore you will be essentially adding noise to the model. Why would you look at oranges if you're trying to understand apples?
xullllllll commented 1 month ago

Oh I see,I still have a few questions about how to choose the value of Z_max. Should I follow what is said in the pySuStaIn tutorial notebook, 'choosing a value around the 95th percentile of your data', or should I do what is mentioned in ZscoreSustain.py, 'when using z-score thresholds of 1, 2, 3, Z_max would typically be 5'?If these two conflict, what should I choose?Must the value of Z_max be greater than the values in Z_vals?