ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

What has been calculated in the pi output if there is only one individual? #66

Closed Jolleboll closed 1 year ago

Jolleboll commented 1 year ago

Hello. I'm running pixy with two populations, "case", and "control". Sometimes I have only one control sample, so I'm confused about how to interpret the pi output for this single individual. I see many stretches of 0 nucleotide diversity, which makes intuitive sense... but shouldn't it be 0 EVERYWHERE, if there is only one individual? Or is it perhaps the case that heterozygosity evaluates to non-zero diversity?

Sorry for my probably dumb question and thank you for this excellent software!

ksamuk commented 1 year ago

Hi there, not a dumb question at all! Your intuition is correct: where there is a single individual, the pi estimate is effectively equivalent to individual-level heterozygosity.

Jolleboll commented 1 year ago

Most excellent big thanks! Have a merry weekend :-]

Jolleboll commented 1 year ago

Uuuuhmmm I just realised this confuses my plans. I was hoping to use pixy to find IBDs, stretches of DNA where some samples are exactly the same (without considering phasing). I'm now selecting only sites (windows ~1 Mbp) with a pi of 0.0. But, this means I'm ignoring all heterozygous sites, does it not? How can I work around this?

Thank you for your patience!

ksamuk commented 1 year ago

Sorry for the late reply here. Hopefully you've found a workable solution. If you are computing pairwise pi between two samples, then regions with pi = 0.0 will be identical between them. However, even if two samples had 100% IBD in that region, it is rather unlikely that pi would actually be zero over a large stretch just due to sequencing errors etc. For this type of thing, I don't think invariant sites actually affect anything, so perhaps another tool might work better.