Inconsistent results across replicate runs and different input formats

James-S-Santangelo commented 2 years ago

Hey Nick,

Thanks a bunch for writing for this improved wrapper around the XP-CLR algorithm! I've been experimenting with it the past couple days and have a couple questions I was hoping you could help with. Some of this might be the expected behaviour, but I just want to double-check and make sure everything is running smoothly.

For context, I'm testing on a single chromosome with ~460K phased SNPs genotyped across two populations, each with 41 individuals. I'm using 50Kb non-overlapping windows and have modified the code just a bit to ensure the genetic map information gets incorporated when loading the VCF (see discussion #71 )

I've noticed different results when running the algorithm multiple times, despite the commands being identical. Here is an example command:

xpclr --out ./test_xpclr_python1 --format vcf --input phased.vcf.gz --samplesA pop1.samples --samplesB pop2.samples --phased --chr CM019101.1 --size 50000 --step 50000 --gdistkey CM

And an abbreviated part of the output showing the windows with non-zero XP-CLR scores

CM019101.1_35700001_35750000    CM019101.1      35700001        35750000        35713697        35749403        -470.2640833357263      -470.8315924629069      5e-05   200.0   318.0   1.1350182543611709      6.645599495710863
CM019101.1_50450001_50500000    CM019101.1      50450001        50500000        50450402        50496781        -103.82130597451395     -104.73759751146754     1e-05   39.0    39.0    1.8325830739071876      10.757065720666866
CM019101.1_03650001_03700000    CM019101.1      3650001 3700000 3650002 3676290 -548.0216995457279      -549.1301298461759      0.0002  200.0   213.0   2.216860600896098       13.022008029903116
CM019101.1_20550001_20600000    CM019101.1      20550001        20600000        20551826        20599898        -534.3650886383808      -535.8522593213173      0.0008  200.0   202.0   2.974341365872988       17.48662048453232
CM019101.1_06450001_06500000    CM019101.1      6450001 6500000 6450040 6496642 -371.30774352936294     -374.3084489448928      0.001   142.0   142.0   6.001410831059729       35.3282511373091

I ran the same command a second time, and here are the non-zero XP-CLR scores

CM019101.1_03650001_03700000    CM019101.1      3650001 3700000 3650002 3676290 -548.2097300523996      -548.459025779834       0.0001  200.0   213.0   0.4985914548688015      3.411854594258649
CM019101.1_50450001_50500000    CM019101.1      50450001        50500000        50450402        50496781        -103.82130597451395     -104.73759751146754     1e-05   39.0    39.0    1.8325830739071876      12.621910262940947
CM019101.1_06450001_06500000    CM019101.1      6450001 6500000 6450040 6496642 -371.30774352936294     -374.3084489448928      0.001   142.0   142.0   6.001410831059729       41.40405189603471

As you can see, the three windows in the second run are present in the first, but the first contains two windows (starting 35700001 and 20550001) not present in the second. Is this behaviour expected?

I also ran XP-CLR using the txt input format and the results were quite different (show below). Any idea why the results would be so divergent depending on the input method?

CM019101.1_11200001_11250000    CM019101.1      11200001        11250000        11207870        11249547        -198.2870189443366      -198.35819217512977     5e-05   76.0    76.0    0.14234646158632813     0.4542605483024633
CM019101.1_35700001_35750000    CM019101.1      35700001        35750000        35713587        35749403        -468.99464880149543     -469.25242483690613     5e-05   200.0   318.0   0.5155520708213999      1.7731704650171318
CM019101.1_13700001_13750000    CM019101.1      13700001        13750000        13700729        13749999        -295.84513128541914     -296.16898894794815     1e-05   117.0   117.0   0.6477153250580159      2.2402358636193296
CM019101.1_88300001_88350000    CM019101.1      88300001        88350000        88306076        88349812        -416.99007356317145     -417.4249544530419      1e-05   161.0   161.0   0.8697617797408839      3.024948804169226
CM019101.1_42100001_42150000    CM019101.1      42100001        42150000        42101976        42149248        -507.1636681512085      -507.6340447742454      5e-05   200.0   257.0   0.9407532460737684      3.275832892274079
CM019101.1_31250001_31300000    CM019101.1      31250001        31300000        31271718        31298398        -205.84814679407478     -206.36394322002735     0.0001  89.0    89.0    1.0315928519051454      3.596860374168088
CM019101.1_81500001_81550000    CM019101.1      81500001        81550000        81500540        81541804        -540.4412413455236      -541.0964098824936      5e-05   200.0   214.0   1.3103370739399907      4.581943421076709
CM019101.1_67950001_68000000    CM019101.1      67950001        68000000        67951608        67954600        -87.38156494325017      -88.10710958402679      5e-05   33.0    33.0    1.4510892815532372      5.0793622106846215
CM019101.1_93550001_93600000    CM019101.1      93550001        93600000        93553686        93599764        -489.40872168301763     -490.2154378137789      0.0002  200.0   202.0   1.6134322615225756      5.6530828646410995
CM019101.1_32800001_32850000    CM019101.1      32800001        32850000        32807008        32849553        -485.0302107983075      -486.1453068056624      0.0001  200.0   345.0   2.2301920147098144      7.832713970337506
CM019101.1_04200001_04250000    CM019101.1      4200001 4250000 4200779 4238997 -488.7639945800862      -491.02409467622255     0.0008  200.0   202.0   4.520200192272682       15.92561080791676
CM019101.1_14450001_14500000    CM019101.1      14450001        14500000        14453318        14499974        -475.60338762542364     -480.99336508276906     0.001   200.0   239.0   10.779954914690848      38.0476056957228

Finally, I'm curious why there are so many windows with XP-CLR scores of zero. This is related to #50 but doesn't seem to have been resolve. For context, of the 1918 windows in the first run above, only the five I've shown had non-zero XP-CLR scores. Here is a brief look at some of the other windows.

CM019101.1_00000001_00050000    CM019101.1      1       50000   17012   47853   -26.96386015268241      -26.96386015268241      0.0     10.0    10.0    0.0     -0.044229301205165925
CM019101.1_00050001_00100000    CM019101.1      50001   100000  60383   99989   -64.56505946420289      -64.56505946420289      0.0     25.0    25.0    0.0     -0.044229301205165925
CM019101.1_00100001_00150000    CM019101.1      100001  150000  100041  149283  -240.97337448394637     -240.97337448394637     0.0     112.0   112.0   0.0     -0.044229301205165925
CM019101.1_00150001_00200000    CM019101.1      150001  200000  150819  192845  -493.49435357204254     -493.49435357204254     0.0     200.0   324.0   0.0     -0.044229301205165925
CM019101.1_00200001_00250000    CM019101.1      200001  250000  200311  235442  -240.31468122967547     -240.31468122967547     0.0     88.0    88.0    0.0     -0.044229301205165925
CM019101.1_00250001_00300000    CM019101.1      250001  300000  253011  299674  -92.94280883384273      -92.94280883384273      0.0     34.0    34.0    0.0     -0.044229301205165925
CM019101.1_00300001_00350000    CM019101.1      300001  350000  300117  349634  -131.6561015250845      -131.6561015250845      0.0     50.0    50.0    0.0     -0.044229301205165925
CM019101.1_00350001_00400000    CM019101.1      350001  400000  352262  399647  -497.6258897388341      -497.6258897388341      0.0     200.0   251.0   0.0     -0.044229301205165925
CM019101.1_00400001_00450000    CM019101.1      400001  450000  412394  442160  -218.45130399355043     -218.45130399355043     0.0     84.0    84.0    0.0     -0.044229301205165925
CM019101.1_00450001_00500000    CM019101.1      450001  500000  452519  498635  -423.619357452077       -423.619357452077       0.0     166.0   166.0   0.0     -0.044229301205165925
CM019101.1_00500001_00550000    CM019101.1      500001  550000  501205  522718  -339.7795617829173      -339.7795617829173      0.0     133.0   133.0   0.0     -0.044229301205165925
CM019101.1_00550001_00600000    CM019101.1      550001  600000  584801  585767  -56.83623006837608      -56.83623006837608      0.0     20.0    20.0    0.0     -0.044229301205165925
CM019101.1_00600001_00650000    CM019101.1      600001  650000  600555  647849  -497.07869055206737     -497.07869055206737     0.0     188.0   188.0   0.0     -0.044229301205165925
CM019101.1_00650001_00700000    CM019101.1      650001  700000  654814  696253  -344.83975058224416     -344.83975058224416     0.0     145.0   145.0   0.0     -0.044229301205165925
CM019101.1_00700001_00750000    CM019101.1      700001  750000  700378  746973  -497.1095964578172      -497.1095964578172      0.0     200.0   283.0   0.0     -0.044229301205165925
CM019101.1_00750001_00800000    CM019101.1      750001  800000  750308  797140  -141.88018162404975     -141.88018162404975     0.0     61.0    61.0    0.0     -0.044229301205165925
CM019101.1_00800001_00850000    CM019101.1      800001  850000  813196  832478  -116.82126656148225     -116.82126656148225     0.0     47.0    47.0    0.0     -0.044229301205165925
CM019101.1_00850001_00900000    CM019101.1      850001  900000  851386  898710  -85.88725249340622      -85.88725249340622      0.0     32.0    32.0    0.0     -0.044229301205165925
CM019101.1_00900001_00950000    CM019101.1      900001  950000  901434  943786  -385.31800206301125     -385.31800206301125     0.0     141.0   141.0   0.0     -0.044229301205165925
CM019101.1_00950001_01000000    CM019101.1      950001  1000000 960152  998083  -254.7240180980443      -254.7240180980443      0.0     95.0    95.0    0.0     -0.044229301205165925
CM019101.1_01000001_01050000    CM019101.1      1000001 1050000 0       0                               4.0     4.0             
CM019101.1_01050001_01100000    CM019101.1      1050001 1100000 1067035 1096817 -209.9168784334494      -209.9168784334494      0.0     79.0    79.0    0.0     -0.044229301205165925
CM019101.1_01100001_01150000    CM019101.1      1100001 1150000 1105521 1135290 -455.285814965769       -455.285814965769       0.0     200.0   217.0   0.0     -0.044229301205165925
CM019101.1_01150001_01200000    CM019101.1      1150001 1200000 1151519 1198127 -495.3443099936061      -495.3443099936061      0.0     200.0   232.0   0.0     -0.044229301205165925
CM019101.1_01200001_01250000    CM019101.1      1200001 1250000 1201972 1249725 -410.7439945399301      -410.7439945399301      0.0     200.0   224.0   0.0     -0.044229301205165925
CM019101.1_01250001_01300000    CM019101.1      1250001 1300000 1253824 1295464 -463.9418345274004      -463.9418345274004      0.0     167.0   167.0   0.0     -0.044229301205165925
CM019101.1_01300001_01350000    CM019101.1      1300001 1350000 1300069 1339991 -36.59392578150957      -36.59392578150957      0.0     13.0    13.0    0.0     -0.044229301205165925
CM019101.1_01350001_01400000    CM019101.1      1350001 1400000 1354062 1391433 -459.8762518346928      -459.8762518346928      0.0     165.0   165.0   0.0     -0.044229301205165925
CM019101.1_01400001_01450000    CM019101.1      1400001 1450000 1400817 1448726 -485.77953823339146     -485.77953823339146     0.0     200.0   290.0   0.0     -0.044229301205165925
CM019101.1_01450001_01500000    CM019101.1      1450001 1500000 1450259 1498658 -400.13252398180805     -400.13252398180805     0.0     200.0   409.0   0.0     -0.044229301205165925
CM019101.1_01500001_01550000    CM019101.1      1500001 1550000 1509565 1546355 -475.16436943254644     -475.16436943254644     0.0     200.0   672.0   0.0     -0.044229301205165925
CM019101.1_01550001_01600000    CM019101.1      1550001 1600000 1554677 1597473 -467.7271159300981      -467.7271159300981      0.0     200.0   201.0   0.0     -0.044229301205165925
CM019101.1_01600001_01650000    CM019101.1      1600001 1650000 1600025 1647280 -490.0368013784556      -490.0368013784556      0.0     192.0   192.0   0.0     -0.044229301205165925
CM019101.1_01650001_01700000    CM019101.1      1650001 1700000 1660337 1697175 -527.130733043159       -527.130733043159       0.0     194.0   194.0   0.0     -0.044229301205165925

Thanks a bunch for your help! I'm happy to send any additional details or example data files, if needed.

James

hardingnj commented 2 years ago

Thanks James.

This was written for a very specific purpose, and it's not really release ready software- so please bear that in mind!

Any modifications you have made to fix the GDIST issue would be gratefully received as a PR.
I think this can be explained by the fact your windows are very large, and your data seem SNP dense, so in many cases you will be selecting 200 SNPs from a larger number. This is likely to be fairly stochastic. I would consider decreasing your window size or increasing your number of selected SNPs (depending on your recombination rate to give appropriate resolution). If you look the windows with <200 SNPs contain the same LLs in both runs.
I can't give you a clear answer, but I would guess it's some inconsistency with the genetic distance calculations. I'm afraid I don't have the capacity to attempt to exhaustively debug this at the moment.
I haven't thought about the popgen implications of this for a long time, but there is some parameter omega that is a measure of how diverged your 2 populations are. If this number is large (or perhaps small?) this could explain the small numbers of instances where divergence is greater than the baseline.
If you are confident in your phasing- I would suggest an alternative metric to infer selection. In our analyses with Anopheles we found XPEHH, iHS, H12 (requiring phased data), PBS, and even Fst to be better at inferring recent selection than XPCLR.

Good luck!

James-S-Santangelo commented 2 years ago

Hey Nick,

Sorry for being slow getting back to this.

I'll gladly submit a PR with the changes I made sometime this week.
Got it, this makes sense, thanks!
Hmm yeah I'd have to think about it a bit more. Differentiation between these populations is pretty low (Fst ~ 0.02).
Yeah I agree. I've actually estimated Fst, XP-nSL, and iHH12 and have found some promising candidates.

Thanks for you help! I'll close this for now.

hardingnj / xpclr

Inconsistent results across replicate runs and different input formats #77