mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
104 stars 25 forks source link

Possible problem with VCF file meaning 0 tested/printed variants #260

Closed michhulin closed 5 months ago

michhulin commented 5 months ago

Hi,

I'm trying to run pyseer with a vcf file. Here is my command output. I think there may be a problem with my input vcf file as it appears none of the variants get passed to be tested by the program. However, I'm unsure what is wrong.

Many thanks!

pyseer --phenotypes psp_pheno --vcf out.vcf --distances dist --min-af 0.01 --max-af 0.99 --cpu 15 --filter-pvalue 1E-8 > pyseer.assoc-snp2

Read 55 phenotypes Detected binary phenotype Structure matrix has dimension (55, 55) Analysing 55 samples found in both phenotype and structure matrix 23018 loaded variants 23018 pre-filtered variants 0 tested variants 0 printed variants

This is the VCF format

fileformat=VCFv4.0

FILTER=

Reference genome=GCF_000012205

INFO=

INFO=

bcftools_normVersion=1.9-64-g28bcc56+htslib-1.9-52-g6e86e38

bcftools_normCommand=norm -m - VCF.GCF_000012205.vcf; Date=Fri Jan 19 14:44:29 2024

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 0886-19 1149B 1302A 1390 82-HI B_21-22 B_21-3 B_21-60 B_21-69 B_21-76 B_21-79 B_21-8 B_22-1 B_22-3 B_22-4 B_22-5 B_22-8 GCF_000012205 GCF_001294035 GCF_001294065 GCF_001294105 GCF_001294265 GCF_001400605 GCF_003412855 GCF_003412865 GCF_003412905 GCF_003412915 GCF_003412965 GCF_003412985 GCF_003412995 GCF_003413015 GCF_003413035 GCF_003413075 GCF_003413095 GCF_003413115 GCF_003413145 GCF_003413155 GCF_003413175 GCF_003413195 GCF_003413225 GCF_003413235 GCF_003413245 GCF_003413305 GCF_003413345 GCF_003413365 GCF_003413375 GCF_003413385 GCF_003413425 GCF_003413445 GCF_003700295 GCF_003701825 GCF_003703035 R12-ID R2NY R2QHB

1 1742643 AAAAAAAA.CCCCGCCT_F G C . . NS=47;AF=0.018 GT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . 0 . . 0 . 0 0 0 0 0 1 . 0 0 0 0 0 0 0 0 0 . 0 0 . . 0 0 0 0 0 0 0

johnlees commented 5 months ago

The usual cause of this is that sample names do not match between VCF and phenotypes.

You may also not have enough observations, 55 samples is small. The variant shown above has a frequency of 1.8% so should be included. But really with this many samples you should change the range to at least 5% to 95%, because with the current filters you are including singletons.

I would also suggest removing the --filter-pvalue option.

michhulin commented 5 months ago

Hi John,

Great thank you removing the filter allowed it to work.

Many thanks Michelle