Closed James-S-Santangelo closed 2 years ago
Thanks James.
This was written for a very specific purpose, and it's not really release ready software- so please bear that in mind!
Any modifications you have made to fix the GDIST issue would be gratefully received as a PR.
I think this can be explained by the fact your windows are very large, and your data seem SNP dense, so in many cases you will be selecting 200 SNPs from a larger number. This is likely to be fairly stochastic. I would consider decreasing your window size or increasing your number of selected SNPs (depending on your recombination rate to give appropriate resolution). If you look the windows with <200 SNPs contain the same LLs in both runs.
I can't give you a clear answer, but I would guess it's some inconsistency with the genetic distance calculations. I'm afraid I don't have the capacity to attempt to exhaustively debug this at the moment.
I haven't thought about the popgen implications of this for a long time, but there is some parameter omega that is a measure of how diverged your 2 populations are. If this number is large (or perhaps small?) this could explain the small numbers of instances where divergence is greater than the baseline.
If you are confident in your phasing- I would suggest an alternative metric to infer selection. In our analyses with Anopheles we found XPEHH, iHS, H12 (requiring phased data), PBS, and even Fst to be better at inferring recent selection than XPCLR.
Good luck!
Hey Nick,
Sorry for being slow getting back to this.
Thanks for you help! I'll close this for now.
Hey Nick,
Thanks a bunch for writing for this improved wrapper around the XP-CLR algorithm! I've been experimenting with it the past couple days and have a couple questions I was hoping you could help with. Some of this might be the expected behaviour, but I just want to double-check and make sure everything is running smoothly.
For context, I'm testing on a single chromosome with ~460K phased SNPs genotyped across two populations, each with 41 individuals. I'm using 50Kb non-overlapping windows and have modified the code just a bit to ensure the genetic map information gets incorporated when loading the VCF (see discussion #71 )
xpclr --out ./test_xpclr_python1 --format vcf --input phased.vcf.gz --samplesA pop1.samples --samplesB pop2.samples --phased --chr CM019101.1 --size 50000 --step 50000 --gdistkey CM
And an abbreviated part of the output showing the windows with non-zero XP-CLR scores
I ran the same command a second time, and here are the non-zero XP-CLR scores
As you can see, the three windows in the second run are present in the first, but the first contains two windows (starting 35700001 and 20550001) not present in the second. Is this behaviour expected?
txt
input format and the results were quite different (show below). Any idea why the results would be so divergent depending on the input method?Thanks a bunch for your help! I'm happy to send any additional details or example data files, if needed.
James