ANGSD/peñalba calc.dxy post-processing script

gmkov commented 4 years ago

Hi, Im working with genotype likelihoods, and would be very handy to see your custom script to process ANGSD/ calc.dxy output mentioned in L397 of the preprint, sorry for the slightly unrelated request.

Many thanks (and great package/prepint!) Gabby

kkorunes commented 4 years ago

Hi Gabby, Thanks for checking out pixy! We'll get these scripts posted properly by the time the manuscript is published, but in the meantime, I'm attaching the script we used plus a few notes on the calcdxy pipeline.

Finalstep_process_angsd_calcdxy.zip

Our script is just a simple perl script that takes as input your per-site dxy file from calc.dxy and your desired window size. Note that line 10 of the attached script has the chromosome length (used to determine how many windows there are) hard-coded, so this will need editing according to your data.

The output of this script includes an average Dxy per window using the total # of sites as the denominator, and an average using the # of sites with data as the denominator. The first of these (the column "AvgOverAllSites") was more in line with our expectations and other estimates of divergence.

Best wishes, Katharine

gmkov commented 4 years ago

Dear Katharine,

Thank you very much for the speedy reply. I was curious about the totlength bit, and you are totally right when noting that the reproducibility and accuracy of the script depends strongly on estimating the total number of sites for the denominator (should be both variant and invariant, so yes closer to total # of sites). I've seen papers reaching an approximation of the number of total sites angsd might be using by estimating read depth at all sites (but with quality/mapping filters) and excluding those sites with low or too high coverage (Delmore et al. 2018, https://onlinelibrary.wiley.com/doi/full/10.1002/evl3.46 )- may I ask how you obtained your totlength input? I wonder if this might drive some of the overestimations in dxy.

We are thinking of circumventing these issues with the dxy denominator by using the joint theta pairwise of a population pair and subtracting the pop-level theta pairwise from it, and then normalising with the joint number of sites that angsd nicely outputs for thetas. Again, thanks for the scripts and for discussing angsd on your ms. Best, Gabby

ksamuk / pixy

ANGSD/peñalba calc.dxy post-processing script #7