ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
119 stars 14 forks source link

Pixy does not detect variable sites #75

Closed sPuechmailleGitHub closed 1 year ago

sPuechmailleGitHub commented 1 year ago

I have created a vcf file with invariant site using gatk yet I encounter an error when running this file with pixy with the following code pixy --stats pi --vcf zzGdx4.GT.g.SNP.All.F1.Excl.vcf.gz --populations Gdtest_sampleIDs_popfile.txt --window_size

which gives the following error:

[pixy] pixy 1.2.7.beta1 [pixy] See documentation at https://pixy.readthedocs.io/en/latest/

[pixy] Validating VCF and input parameters... [pixy] Checking write access...OK [pixy] Checking CPU configuration...OK [pixy] Checking for invariant sites...Exception: [pixy] ERROR: the provided VCF appears to contain no variable sites. It may have been filtered incorrectly, or otherwise corrupted.

Surprisingly, this file does contain variable sites though these are a bit far from the start (ca. 16000bp). When I remove these 16,000 first invariable sites (mostly sites without data) from the vcf.gz file, then it works. Could it be that pixy only checks the 'validity' of the input data from the first thousands of positions instead of the full data set (for speed reasons?)? Anyway, I just wanted to report this issue in case others do encounter it? Is there another way to get around the error besides prunning these first invariable sites?

Thanks in advance

ksamuk commented 1 year ago

Hi there! Yes, pixy does only check for invariant sites in the first 10 000 or so sites to avoid parsing the whole VCF. Now that we use tabix, we could probably change this to a random 10 000 sites or something along those lines.

If you are (very) sure that you indeed have invariant sites in your file, you can pass the flag '--bypass_invariant_check yes' along with your other arguments to skip the check. This is only recommended if you are confident that you indeed have invariant and variant sites in your file.

Out of curiosity, does your population/organism of study have very low genetic diversity?

sPuechmailleGitHub commented 1 year ago

Thanks for this super quick answer. It all makes sense. Bypassing the invariants check indeed removes the error about variable sites! The reason for the many apparently 'invariant' sites at the start is because the first contig starts with a repetitive element that I have filtered out by being quite stringent on mapping scores. I have now removed positions with missing data in all individuals. All works fine now. Thks.