mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
104 stars 25 forks source link

Question: how to identify which phenotype a variant/kmer is associated with? #127

Closed reedus-123 closed 3 years ago

reedus-123 commented 3 years ago

This isn't an issue but rather a query (apologies if it's in the wrong place).

Having followed the GWAS pipeline, once significant variants/kmers are identified how can one go about identifying which phenotype it's positively associated with?

Thanks in advance, and thanks for creating this software.

mgalardini commented 3 years ago

Hi, could you maybe clarify your question a bit more? The phenotype variants are associated to is the one you provide through the --phenotypes and --phenotype-column command line arguments. The latter is only required if your phenotype is not encoded in the last column. Hope this helps.

reedus-123 commented 3 years ago

Thank you for your explanation. What I would like to know is if you have two phenotypes and you find that there is a significant association between variant and phenotype, is there a way to find out which of the two phenotypes the variant is positively associated with? (i.e., found more frequently)

So for instance, if I was comparing two sets of samples (i.e from cattle and from soil) and I found that there was a number of significant variants, how would I know if the presence of these variants is significantly associated with cattle or soil? Also, does the software support the comparison between 3 groups? (i.e., cattle, soil and water)

mgalardini commented 3 years ago

If you want to do associations against a discrete phenotype with multiple classes (> 2), than I'd suggest using something called dummy encoding, or more specifically one-hot encoding. This way you will have 3 phenotypes in your last example (e.g. cattle vs. the rest, soil vs. the rest, and water vs. the rest), and you can run three separate associations. Does that make sense?

reedus-123 commented 3 years ago

Yes, it does. So in the phenotypes.pheno file, the structure would be sample phenotype a.fasta 1 b.fasta 2 c.fasta 3

where 1, 2 and 3 are cattle, soil and water respectively.

But my main question is - once I've run the pipeline and concluded that there is a significant association between a variant and phenotype, what steps can I take to figure out which of the three phenotypes the variant is associated with? i.e., is it associated with cattle, soil or water?

Thanks again for answering my questions, I really appreciate it.

mgalardini commented 3 years ago

Hi,

what I meant is something like this:

sample cattle soil water 
a.fasta 1 0 0
b.fasta 0 1 0
c.fasta 0 0 1
reedus-123 commented 3 years ago

Oh, perfect, thank you. So if I got this right, the variants identified as significant would be associated with the samples marked as a '1' and not a '0' in their respective column? Is that correct?

mgalardini commented 3 years ago

Yes, that is correct

On Sat, Nov 14, 2020, 23:10 reedus-123 notifications@github.com wrote:

Oh, perfect, thank you. So if I got this right, the variants identified as significant would be associated with the samples marked as a '1' and not a '0' in their respective column? Is that correct?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/mgalardini/pyseer/issues/127#issuecomment-727272084, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAISWXZLVL6X4BSMBHQ2VGDSP353TANCNFSM4TQBQBHA .

reedus-123 commented 3 years ago

Sorry about all the questions, thank you for all your help!

iaposto commented 2 years ago

Hello,

I would like to ask a follow-up question; After running a GWAS (--lmm mode, --pres), I see that a lot of significant genes are homogenously distributed among the 0 and 1 phenotypes or even skewed towards 0. This is difficult to interpret given that the significant genes should be associated with the 1 phenotype. Does it makes sense to keep only those with β>0?

Thanks a lot!

johnlees commented 2 years ago

This depends on your phenotype, but generally I would say you should keep them all!

iaposto commented 2 years ago

Thanks for the advice! The phenotype is commensal (0) pathogenic (1).


From: John Lees @.> Sent: Tuesday, May 24, 2022 6:56:11 PM To: mgalardini/pyseer @.> Cc: iaposto @.>; Comment @.> Subject: Re: [mgalardini/pyseer] Question: how to identify which phenotype a variant/kmer is associated with? (#127)

This depends on your phenotype, but generally I would say you should keep them all!

— Reply to this email directly, view it on GitHubhttps://github.com/mgalardini/pyseer/issues/127#issuecomment-1136107101, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHDKLGVFNZHE55WLRVNEXKDVLT3ZXANCNFSM4TQBQBHA. You are receiving this because you commented.Message ID: @.***>