simonhmartin / genomics_general

General tools for genomic analyses.
342 stars 93 forks source link

Pi calculation - formula for haploid data #119

Open OKersten opened 3 months ago

OKersten commented 3 months ago

Hi Simon,

first of all thanks for the awesome tool(s). Using them a lot.

Anyhow, I have a "strange" population in my dataset with elevated Pi values. In order for me to figure out what is going on, I'd like to "see under the hood" meaning looking at windows or positions in the raw data that cause these increased diversity values.

Long story short, I tried to calculate the pi values myself for some windows. I have (pseudo)haploid data and also use invariant sites. Running it via popgenWindows.py -f haplo -g ${1}_LG${ChromoNo}.rename.haplo.gz -o ${1}_LG${ChromoNo}_nucdiv.csv --popsFile ${1}_Populations.txt -p ${1} -w 500000 -T 10 --roundTo 7 -m 500

Afterwards I am just looking the .haplo.gz and filter it down to a window of question, do my calculations (pi for each variant site, summing all pi values and then dividing by the total # of sites in that window) and compare it with the popgenWindows output.

Regarding the calculations, I found the following in the genomics.py code for the pi per-site: def baseCountPi(baseCounts): N = sum(baseCounts) return (baseCounts[0]*baseCounts[1] + baseCounts[0]*baseCounts[2] + baseCounts[0]*baseCounts[3] + baseCounts[1]*baseCounts[2] + baseCounts[1]*baseCounts[3] + baseCounts[2]*baseCounts[3]) / (.5*N*(N-1)) I assume this also works for haploid data.

Did I assume correctly that the window based Pi then is simply the sum of all per-site pi's divided by the number of sites in the window?

In that case I see some discrepancies between my calculations and the popgenWindows output ....

Thanks for your help!

-Oliver

simonhmartin commented 1 month ago

Hi Oliver, Apologies for the slow response. There are two ways to calculate pi with popgenWindows.py. The default is different from the one you mention above. The main difference is how they handle missing data.

The default method (also accessible with --analysis popDist) averages pairwise distance between all pairs of haplotypes for the full length of the window, excluding sites at which one or the other haplotype in the pair have missing data. By excluding missing data on a pairwise basis, we maximise the use of available data.

The alternative method is implemented if --analysis popFreq is set. This uses allele frequencies at each site according to the formula you indicated above. However, it excludes any site with missing data in even a single individual. This is necessary for the Watterson's theta to make sense. This approach is therefore not maximising use of the data.

If there is no missing data in any individual, the two methods should give the same result, and it should indeed reflect the per-site pi averaged over all sites in the window. If you have no missing data in your geno file yet you are seeing discrepancies, do let me know. It might be necessary to dig in and find out what is causing the discrepancy with some dummy datasets.

Best wishes, Simon