ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

No variable sites #81

Closed avanwallendael closed 9 months ago

avanwallendael commented 1 year ago

Hi @ksamuk! I keep getting the "ERROR: the provided VCF appears to contain no variable sites. It may have been filtered incorrectly, or otherwise corrupted." on a vcf that I am certain has variable sites and does not appear corrupted. For example when I run bcftools view --type snps on the vcf, the output will run in pixy with the --bypass_invariant_check flag.

I have filtered out multiallelic snps and indels with vcftools - is there another necessary filtering step I'm missing?

Other info in case it's relevant: this is WGS data for 266 individuals. There is large genetic diversity and variation in sequencing depth. I used GATK to call snps and generate the invariant sites VCF. I'm running pixy on each chromosome individually.

Thank you!

ksamuk commented 1 year ago

Hey Acer! I am just in the field but I will have a look into this when I get back. Can you email me the first 1000 or so lines of your vcf and I can have a look?

avanwallendael commented 1 year ago

Thanks Kieran! Actually I'm pretty sure I figured it out. If I'm reading this right, the variable sites check only looks at the first 10k lines, and the chromosomes that failed for me only had variants after that. Upping line 618 to check head -n 20000 or more would have worked for me.

(base) gunzip -c interval_0003_GWHAAEZ00000004.1_inv._filter.vcf.gz | grep -v '#' | head -n 10000 | awk '{print $5}' | sort | uniq . (base) gunzip -c interval_0003_GWHAAEZ00000004.1_inv._filter.vcf.gz | grep -v '#' | head -n 20000 | awk '{print $5}' | sort | uniq . A A,G C *,C G T T,C

ksamuk commented 12 months ago

Thanks, Acer, we'll have a look at this for the next update.