Per marker stats - Githubissues

ricebrian commented 2 years ago

Hi Pixy creators,

Thank you for the excellent population genetics tool. I have been successful at running pixy and getting estimates of Pi for a window size of 10000. My understanding is that for a window the reported Pi is the average calculated for all variants in that window. I am able to run this analysis using ~3000 individuals and with 100k SNP markers which can take about 20 mins on my Macbook pro with 16GB of ram running on 8 cores. I am interested in getting the estimate of Pi for each individual marker and doing my own averages based on varying window sizes. I run the following code and the analysis is taking forever (> 1 day and still running). Am I doing something wrong or is what I am asking too intense for pixy? Thanks in advance for your help. pixy --stats pi \ --vcf test_filtered.vcf.gz \ --populations 2698taxaandPops.txt \ --window_size 1 \ --n_cores 8 \ --output_prefix pixy_outputChr1 \

ksamuk commented 2 years ago

Hi there,

We envisioned the single-site mode being used for 'zooming in' on specific regions, not whole genome output. It was actually added by user request, and not in our original design. The current algorithm is optimized to work on windows, and the single site computations end up having a lot of overhead. If you want to use pixy, it would be much faster to run a few commands with different window sizes vs. the single-site mode.

If you do need fast single-site estimates, I'd recommend looking into scikit-allel, which is what pixy uses under the hood (mostly for their nice data structures for genotypes/VCF input). Just be careful about tracking invariant and variant sites. It's a python library, rather than a command-line program, so you'll need to do some scripting to get it to work for your application.

All the best,

Kieran

ricebrian commented 2 years ago

Thank you for the reply and explanation

ksamuk / pixy

Per marker stats #57