mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
104 stars 25 forks source link

Dealing with intra-patient diversity -- covariates with elastic net model? #159

Closed cizydorczyk closed 2 years ago

cizydorczyk commented 3 years ago

Hello,

I have a scenario where I have multiple isolates/patient and was wondering how best to incorporate this into a GWAS analysis. Intuitively, it seems that presence/absence of a variant in an isolate will not correspond to presence/absence in a patient, yet my phenotype is defined by patient (i.e. patient does/does not have disease). I came across a recent microbial GWAS review (San et al. 2020 https://doi.org/10.3389/fmicb.2019.03119) that suggests including such "intra-patient diversity" as covariates in an analysis, and recommends PySEER as one option that allows covariates.

Would such an approach make sense, or does it grossly violate assumptions of a GWAS? (broadly speaking)

Second, is it possible to include a covariates file when using the elastic net model? I cannot find anywhere in the documentation that specifically states whether this is/is not possible. I only found reference to the lineage clusters option --lineage-clusters, but I do not think this is what I am looking for.

Any help in understanding is greatly appreciated.

Thank you, Conrad

mgalardini commented 3 years ago

To answer your second question, you can definitely use covariates with the elastic net model.

Regarding the first one, I am a bit confused by your statement "it seems that presence/absence of a variant in an isolate will not correspond to presence/absence in a patient". Could you please clarify this a bit?

One thing you could try is to encode the presence of multiple isolates in a patient as a binary variable in your covariates matrix, though I may have misunderstood your specific dataset.

cizydorczyk commented 3 years ago

Thank you for your quick response!

What I meant by my statement is that if in patient A we have isolates 1-10, and (in truth) isolates 1-7 have a causal variant and 8-10 do not, and the phenotype I am working with is derived from the patient and not individual isolates (i.e. patient from whom isolates were obtained does/does not have disease), then would it not pose a problem that I am assigning the same phenotype to these 10 isolates, despite some having the causal variant and others not?

Perhaps I am the one who is confused here. Admittedly, I am not entirely certain what effect multiple isolates/patient has on GWAS, other than potentially introducing further population structure.

Thank you, Conrad

mgalardini commented 3 years ago

So if I understood correctly you have genome sequences of the individual isolates and a phenotype that is per-patient, so you cannot attach a "true" phenotype to each isolate. I agree that this makes the analysis tricky, as in principle each sample can be assigned more than one phenotype, assuming you observe certain isolates in multiple patients that have different phenotypes.

johnlees commented 3 years ago

One thing you may wish to try is modelling the patient identifier as a random effect (especially if number of covariates for patients is larger). We don't support this in pyseer, but you can use a linear mixed model package such as lme4 to make these models (with some care to model the genetic relatedness matrix in the same way as in pyseer/limix), or I think you could do it in a general Bayesian inference package such as stan.

mgalardini commented 2 years ago

Closing for lack of follow-up discussion