mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
104 stars 25 forks source link

Using phylogeny based distance vs mash distance for lineage effects #196

Closed ss8222 closed 2 years ago

ss8222 commented 2 years ago

Hello,

My goal is: to identify lineage and locus effects associated with a numeric phenotype.

I was wondering if it makes a difference if I use for the --dist the distances from my gubbins phylogeny (python scripts/phylogeny_distance.py core_genome_aln.tree > phylogeny_dists.tsv) vs if I use mash distance. Is one better than the other? (of note, I am already using the gubbins phylogeny for --similarity)

I am providing custom lineage definitions (ST) with the --lineage-clusters. The output I presume will indicate which MDS are associated with the phenotype. Is there anyway to tell which genomes correspond to each MDS? I plan on making a tree and annotating it with the MDS information so this information will be useful.

A very basic question on interpreting beta - does a negative beta mean that the presence of that unitig is associated with a lower phenotype value? Taking this further, if I have a negative beta, if i look at the 'k-sample' column, should I expect that the entries have a lower phenotype value than those of the 'nk-sample' for that entry?

Thank you!

johnlees commented 2 years ago

I was wondering if it makes a difference if I use for the --dist the distances from my gubbins phylogeny (python scripts/phylogeny_distance.py core_genome_aln.tree > phylogeny_dists.tsv) vs if I use mash distance. Is one better than the other? (of note, I am already using the gubbins phylogeny for --similarity)

I would suggest using the same tree to generate both distances (though in most cases I don't expect this to matter that much).

I am providing custom lineage definitions (ST) with the --lineage-clusters. The output I presume will indicate which MDS are associated with the phenotype. Is there anyway to tell which genomes correspond to each MDS? I plan on making a tree and annotating it with the MDS information so this information will be useful.

You will get which ST each variant is most strongly associated with in the output. You will also get a list of how strongly associated each ST is with the phenotype. This won't use MDS (from my memory), and ST <-> genome should be a simple one-to-one relationship.

A very basic question on interpreting beta - does a negative beta mean that the presence of that unitig is associated with a lower phenotype value? Taking this further, if I have a negative beta, if i look at the 'k-sample' column, should I expect that the entries have a lower phenotype value than those of the 'nk-sample' for that entry?

Yes, that's right. Although, if you have covariates this won't necessarily be true if you just look at the means. Basically it's just doing a regression, so any intuition you have from multivariate linear/logistic regressions will pretty much apply to pyseer too