mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
104 stars 25 forks source link

Perfectly separable data error for null model #189

Closed nicola-palmieri closed 2 years ago

nicola-palmieri commented 2 years ago

Dear Author

I have started playing with pyseer on a small dataset of 12 strains from a bacterium called Gallibacterium anatis, from which I have computed SNPs using parsnp. In addition I have a table with binary values for resistance to different antibiotics, I am trying to fit the first antibiotic phenotypic binary vector to the SNPs, but I get the following error:

pyseer --phenotypes CEZ.pheno --vcf parsnp_filt.vcf.gz --distances mash.tsv --lineage --print-samples > CEZ_SNPs.txt
Read 12 phenotypes
Detected binary phenotype
Structure matrix has dimension (12, 12)
Population MDS scaling restricted to 6 dimensions instead of requested 10
Analysing 12 samples found in both phenotype and structure matrix
Perfectly separable data error for null model
Could not fit null model, exiting

These are the first 10 lines of the VCF file - it contains 75767 SNPs:

CHROM   POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  51_liver_2      4_ovar_2        24_ovar_1       51_ovar_1       24_heart_2      4_liver_2       51_liver_1      4_ovar_1        24_ovar_2       51_ovar_2       24_heart_1      4_liver_1
CP002667        231     CTGAATTTGC.CGCACAAGTT   C       A       40      PASS    CDS=UMN179_00002;AAR=A;AAA=A;SYN        GT      1       1       0       1       0       1       1       1       0       1       0       1
CP002667        235     ATTTGCCGCA.CAAGTTAATA   C       G       40      PASS    CDS=UMN179_00002;AAR=Q;AAA=E    GT      1       1       0       1       0       1       1       1       0       1       0       1
CP002667        236     TTTGCCGCAC.AAGTTAATAT   A       G       40      PASS    CDS=UMN179_00002;AAR=Q;AAA=R    GT      1       0       0       1       0       0       1       0       0       1       0       0
CP002667        358     AGATAATTAC.AGAGCTTATC   A       C       40      PASS    CDS=UMN179_00002;AAR=R;AAA=R;SYN        GT      1       1       1       1       1       1       1       1       1       1       1       1
CP002667        393     TTAATGAAAT.CGGTATCCAA   C       T       40      PASS    CDS=UMN179_00002;AAR=I;AAA=I;SYN        GT      0       0       1       0       1       0       0       0       1       0       1       0
CP002667        817     AATGTGGAGA.GTTGAGGATA   G       A       40      PASS    CDS=UMN179_00002;AAR=V;AAA=I    GT      0       0       1       0       1       0       0       0       1       0       1       0
CP002667        818     ATGTGGAGAG.TTGAGGATAT   T       C       40      PASS    CDS=UMN179_00002;AAR=V;AAA=A    GT      0       0       1       0       1       0       0       0       1       0       1       0
CP002667        869     GATAATTTTG.TGCAAAGAGA   T       C       40      PASS    CDS=UMN179_00002;AAR=V;AAA=A    GT      0       1       1       0       1       1       0       1       1       0       1       1
CP002667        2432    CAATGGCAGG.TGGAATTACA   T       C       40      PASS    CDS=UMN179_00005;AAR=G;AAA=G;SYN        GT      1       1       1       1       1       1       1       1       1       1       1       1
CP002667        2526    AGACAATAGT.GTTCCTGCTA   G       A       40      PASS    CDS=UMN179_00005;AAR=V;AAA=I    GT      1       1       1       1       1       1       1       1       1       1       1       1
CP002667        2550    AGGCACACGA.GAAAAATATG   G       A       40      PASS    CDS=UMN179_00005;AAR=E;AAA=K    GT      1       1       1       1       1       1       1       1       1       1       1       1

Here is the phenotype file for the first antibiotic (CEZ = cefazoline):

samples CEZ
24_heart_1      1
24_heart_2      1
24_ovar_1       1
24_ovar_2       1
4_liver_1       1
4_liver_2       1
4_ovar_1        0
4_ovar_2        0
51_liver_1      1
51_liver_2      1
51_ovar_1       1
51_ovar_2       1

This is the mash distance table for the 12 strains:

       24_heart_1      24_heart_2      24_ovar_1       24_ovar_2       4_liver_1       4_liver_2       4_ovar_1        4_ovar_2        51_liver_1      51_liver_2      51_ovar_1       51_ovar_2
24_heart_1      0.0     0.000429547     0.000107506     0.000105109     0.0229527       0.0229379       0.0228204       0.0229011       0.0219701       0.0221764       0.0220268       0.0219842
24_heart_2      0.000429547     0.0     0.000500634     0.000488355     0.0231301       0.0230931       0.0229453       0.0230708       0.0218781       0.0221621       0.021963        0.0219701
24_ovar_1       0.000107506     0.000500634     0.0     4.05279e-05     0.0229232       0.0229011       0.0227764       0.0228497       0.0219347       0.0221692       0.0220126       0.0219842
24_ovar_2       0.000105109     0.000488355     4.05279e-05     0.0     0.0229232       0.0229306       0.0227984       0.0228424       0.0219064       0.0221692       0.0220055       0.0219701
4_liver_1       0.0229527       0.0231301       0.0229232       0.0229232       0.0     6.91981e-05     0.000353903     0.000189217     0.0241291       0.0243306       0.02416 0.024106
4_liver_2       0.0229379       0.0230931       0.0229011       0.0229306       6.91981e-05     0.0     0.000336872     0.000215742     0.024106        0.0243073       0.0241291       0.0240906
4_ovar_1        0.0228204       0.0229453       0.0227764       0.0227984       0.000353903     0.000336872     0.0     0.00022057      0.0239829       0.02416 0.0240444       0.0240213
4_ovar_2        0.0229011       0.0230708       0.0228497       0.0228424       0.000189217     0.000215742     0.00022057      0.0     0.0241368       0.0243306       0.0241832       0.0241291
51_liver_1      0.0219701       0.0218781       0.0219347       0.0219064       0.0241291       0.024106        0.0239829       0.0241368       0.0     0.00033444      0.00016514      0.00015552
51_liver_2      0.0221764       0.0221621       0.0221692       0.0221692       0.0243306       0.0243073       0.02416 0.0243306       0.00033444      0.0     0.000317429     0.000254404
51_ovar_1       0.0220268       0.021963        0.0220126       0.0220055       0.02416 0.0241291       0.0240444       0.0241832       0.00016514      0.000317429     0.0     0.000148309
51_ovar_2       0.0219842       0.0219701       0.0219842       0.0219701       0.024106        0.0240906       0.0240213       0.0241291       0.00015552      0.000254404     0.000148309     0.0

What could be the problem?

Thank you!!

Kind regards Nicola

mgalardini commented 2 years ago

I am afraid your sample size is simply too small for pyseer to be used, let alone return meaningful results. The original SEER paper has some comments about sample size.

nicola-palmieri commented 2 years ago

Thanks for the prompt reply. This was a test dataset but I am planning to sequence more strains. I thought I could already see some associations. I will retry when I have more data.