mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
104 stars 25 forks source link

Bad ChiSq Only for isolates #198

Closed noahaus closed 2 years ago

noahaus commented 2 years ago

Greetings!

I successfully am able to run pyseer on a group of isolates that are segregated by country of origin. I end up using this command to do so.

pyseer --phenotypes country_data.pheno --vcf mbov_combine.vcf --distances mash.tsv --min-af 0.02 --max-af 0.98 --max-dimensions 10

A note about my phenotype file is that the segregation between isolates from different countries is perfect, there are no mixing of isolates, and this is demonstrated by this phylogeny that I produced and the second bar that represents country of origin:

image

The output of pyseer is this, with each position that is output having the tag bad-chisq:

Read 700 phenotypes
Detected binary phenotype
Structure matrix has dimension (700, 700)
Analysing 700 samples found in both phenotype and structure matrix
Perfectly separable data error for null model
Could not fit null model, exiting
(pyseer) noahaus@ss-sub4 snp_gwas$ pyseer --phenotypes country_data.pheno --vcf mbov_combine.vcf --distances mash.tsv --min-af 0.02 --max-af 0.98 --max-dimensions 10  
Read 700 phenotypes
Detected binary phenotype
Structure matrix has dimension (700, 700)
Analysing 700 samples found in both phenotype and structure matrix
variant af  filter-pvalue   lrt-pvalue  beta    beta-std-err    intercept   PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10    notes
LT708304.1_4039_C_T 2.37E-01    1.86E-12    1.00E+00    -9.11E+00   1.35E+00    -1.29E+00   9.69E+00    1.20E+01    5.16E+00-2.14E+00   -3.77E+00   8.36E-02    -1.19E+00   -8.14E-03   2.19E-01    2.01E+00    bad-chisq
LT708304.1_4480_T_C 4.46E-01    4.35E-45    1.00E+00    1.53E+01    1.78E+00    -1.04E+01   -1.32E+01   3.55E+00    7.56E+00-1.89E+00   5.59E+00    5.30E+00    -6.74E-01   2.59E+00    2.28E-01    -5.02E-01   bad-chisq
LT708304.1_8741_T_C 4.64E-01    5.84E-42    1.00E+00    1.29E+01    1.92E+00    -9.52E+00   -1.05E+01   5.97E+00    7.98E+00-1.87E+00   4.07E+00    -5.96E+00   7.89E-01    -2.80E+00   2.49E+00    -3.68E-01   bad-chisq
LT708304.1_12804_C_G    3.57E-02    1.50E-02    1.00E+00    1.83E+00    6.88E-01    -3.36E+00   1.55E+00    1.88E+01    3.30E+00-2.90E+00   -1.07E+00   7.49E-01    -1.14E+00   3.07E-01    6.79E-01    1.15E+00    bad-chisq
LT708304.1_13268_C_T    2.29E-02    5.33E-02    1.00E+00    6.93E+00    9.04E-01    -3.52E+00   1.70E+00    1.84E+01    1.02E+01-2.60E+00   -9.78E-01   1.86E+00    -9.60E-01   7.01E-01    1.09E+00    9.62E-01    bad-chisq
LT708304.1_16939_G_T    3.29E-02    1.99E-02    1.00E+00    7.25E-01    7.21E-01    -3.38E+00   1.76E+00    1.88E+01    3.34E+00-2.84E+00   -1.49E+00   7.98E-01    -1.21E+00   1.71E-01    3.92E-01    1.36E+00    bad-chisq
LT708304.1_23501_C_T    2.29E-02    5.33E-02    1.00E+00    6.93E+00    9.04E-01    -3.52E+00   1.70E+00    1.84E+01    1.02E+01-2.60E+00   -9.78E-01   1.86E+00    -9.60E-01   7.01E-01    1.09E+00    9.62E-01    bad-chisq
LT708304.1_27724_T_C    2.37E-01    1.86E-12    1.00E+00    -9.11E+00   1.35E+00    -1.29E+00   9.69E+00    1.20E+01    5.16E+00-2.14E+00   -3.77E+00   8.36E-02    -1.19E+00   -8.14E-03   2.19E-01    2.01E+00    bad-chisq
LT708304.1_27911_C_T    2.37E-01    1.86E-12    1.00E+00    -9.11E+00   1.35E+00    -1.29E+00   9.69E+00    1.20E+01    5.16E+00-2.14E+00   -3.77E+00   8.36E-02    -1.19E+00   -8.14E-03   2.19E-01    2.01E+00    bad-chisq
LT708304.1_27960_A_G    2.29E-02    5.33E-02    1.00E+00    6.93E+00    9.04E-01    -3.52E+00   1.70E+00    1.84E+01    1.02E+01-2.60E+00   -9.78E-01   1.86E+00    -9.60E-01   7.01E-01    1.09E+00    9.62E-01    bad-chisq
LT708304.1_29061_C_T    1.86E-01    2.99E-154   1.00E+00    2.05E+01    9.31E-01    -7.49E+00   -4.72E-01   -1.50E+01   -5.81E+07.02E-01    -1.00E+00   3.65E+00    1.55E+00    3.71E+00    3.17E+00    -1.48E-01   bad-chisq
LT708304.1_29124_G_A    3.14E-02    2.28E-02    1.00E+00    2.35E+00    6.99E-01    -3.43E+00   1.59E+00    1.91E+01    3.40E+00-3.17E+00   -1.06E+00   5.49E-01    -8.64E-01   8.72E-01    -2.86E-02   1.79E+00    bad-chisq
LT708304.1_33426_G
...

I wanted to ask if I'm doing something incorrectly, or is this just a consequence of having too well segregated data as input. Any advice would be greatly appreciated!

johnlees commented 2 years ago

Hi Noah,

Yes, this is probably expected with such strong segregation of the phenotype, so I'd be careful with the results as they are very likely just to be on the branches leading to each lineage. Still, you might like to try the LMM instead, see https://pyseer.readthedocs.io/en/master/best_practices.html and the usage guide -- you'll want to switch out your distances for similarity (which you can generate from the phylogeny above) and add --lmm.