mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
109 stars 27 forks source link

How to deal with linkage in the data #192

Closed martinastoycheva closed 2 years ago

martinastoycheva commented 2 years ago

Hello,

I am having problems with controlling for the population stuctre in my sample set and I was wodnering if I can get some help? I am not sure when the QQplots are good enough in order to be able to trust the p values.

I have ran pyseer with gene presence file and a whole genome vcf. I am using a very good quality hgt free phylogeny and I have used the following command:

pyseer --phenotypes ${3} \
--vcf ${1} \
--similarity ../phylogeny_similarity_rooted_clonalML.tsv \
--lmm \
--cpu 8  > results/${2} 

I was wondering if I need to do any LD pruning prior to runing pyseer? I am asking because I noticed that if I do some LD pruning with bcftools (e.g r2=0.8, window size=1000) I can get the QQplots to lool very good:

image

as opposed to bad before pruning:

image

However, this approach removes most of the SNPs and makes the results very hard to interpret, and I am overall not very confident it is very good. Therefore, I was wondering if there is a standard way to overcome this issue?

I am also wondering how this could be extended to unitigs and accessory genes? As my gene presence absence QQplots don't look that great either:

image

Cheers, Martina

johnlees commented 2 years ago

This looks pretty typical for a bacterial GWAS to me, I can't see anywhere you're going wrong. Sometimes/often the population structure/phenotype stratification in the samples is impossible to completely 'remove'. It can be useful to draw the phenotype on the tree to have a look at this. But the standard significance threshold approach might not work too well with these results, and I would generally advise seeing this as a ranking of most likely associations.

It's best not to prune, as you will lose signal that way (without substantively changing the statistical test).

I would use exactly the same approach for genes and unitigs.

martinastoycheva commented 2 years ago

Thanks very much for the quick response! If the p values are a bit unrealible in my case, would you advice to try the whole genome lasso model instead and not use p values?

johnlees commented 2 years ago

You can certainly try this as an alternative, though it is more focused on prediction, rather than finding causal variants.