kr-colab / diploSHIC

feature-based deep learning for the identification of selective sweeps
MIT License
49 stars 14 forks source link

Sensitivity to mutation positions for "ms" format data #20

Closed molpopgen closed 5 years ago

molpopgen commented 5 years ago

It seems that handling "ms" format requires that mutation positions are within [0, 1), but that this is not checked. This can lead to problems when simulations allow for positions to be outside that range, which is common in forward sims.

MRE is below. I added print(newPositions) at line 54 of msTools.py to show what happens.

Here is data set 1:

/usr/local/bin/mspms 10 1 -t 10
761793949 540378133 223878765

//
segsites: 20
positions: 0.006 0.094 0.102 0.133 0.303 0.339 0.383 0.383 0.426 0.440 0.538 0.563 0.593 0.604 0.667 0.696 0.701 0.771 0.828 0.847 
10010000000100001000
01000010001001000000
01000010001001000000
00000101100000100000
10110000000000010000
01000010001011000001
00000101000000100010
01001000001001000000
10110000010000000100
00010000000000000000

Getting features from it:

/usr/bin/python3 makeFeatureVecsForSingleMs_ogSHIC.py out.ms 1100000 11 None None all 0.25 0.0 None stats
file name='out.ms'maskFileName='None': not doing any masking!
makeFeatureVecsForSingleMs_ogSHIC.py:108: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
  start = time.clock()
[6600, 103400, 112200, 146300, 333300, 372900, 421300, 421301, 468600, 484000, 591800, 619299, 652300, 664400, 733700, 765600, 771100, 848100, 910800, 931700]
makeFeatureVecsForSingleMs_ogSHIC.py:209: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
  time.clock()-start))
total time spent calculating summary statistics and generating feature vectors: 0.845340 secs

NOTE the range of mutation positions!!! Keep that in mind for later

The resulting stats:

pi_win0 pi_win1 pi_win2 pi_win3 pi_win4 pi_win5 pi_win6 pi_win7 pi_win8 pi_win9 pi_win10    thetaW_win0 thetaW_win1 thetaW_win2 thetaW_win3 thetaW_win4 thetaW_win5 thetaW_win6 thetaW_win7 thetaW_win8 thetaW_win9 thetaW_win10    tajD_win0   tajD_win1   tajD_win2   tajD_win3   tajD_win4   tajD_win5   tajD_win6   tajD_win7   tajD_win8   tajD_win9   tajD_win10  thetaH_win0 thetaH_win1 thetaH_win2 thetaH_win3 thetaH_win4 thetaH_win5 thetaH_win6 thetaH_win7 thetaH_win8 thetaH_win9 thetaH_win10    fayWuH_win0 fayWuH_win1 fayWuH_win2 fayWuH_win3 fayWuH_win4 fayWuH_win5 fayWuH_win6 fayWuH_win7 fayWuH_win8 fayWuH_win9 fayWuH_win10    maxFDA_win0 maxFDA_win1 maxFDA_win2 maxFDA_win3 maxFDA_win4 maxFDA_win5 maxFDA_win6 maxFDA_win7 maxFDA_win8 maxFDA_win9 maxFDA_win10    HapCount_win0   HapCount_win1   HapCount_win2   HapCount_win3   HapCount_win4   HapCount_win5   HapCount_win6   HapCount_win7   HapCount_win8HapCount_win9  HapCount_win10  H1_win0 H1_win1 H1_win2 H1_win3 H1_win4 H1_win5 H1_win6 H1_win7 H1_win8 H1_win9 H1_win10    H12_win0    H12_win1    H12_win2    H12_win3    H12_win4    H12_win5    H12_win6    H12_win7    H12_win8    H12_win9    H12_win10   H2/H1_win0  H2/H1_win1  H2/H1_win2  H2/H1_win3  H2/H1_win4  H2/H1_win5  H2/H1_win6  H2/H1_win7  H2/H1_win8  H2/H1_win9  H2/H1_win10 ZnS_win0    ZnS_win1    ZnS_win2    ZnS_win3    ZnS_win4    ZnS_win5    ZnS_win6    ZnS_win7    ZnS_win8    ZnS_win9    ZnS_win10   Omega_win0  Omega_win1  Omega_win2  Omega_win3  Omega_win4  Omega_win5  Omega_win6  Omega_win7  Omega_win8  Omega_win9  Omega_win10 distVar_win0    distVar_win1    distVar_win2    distVar_win3    distVar_win4    distVar_win5    distVar_win6    distVar_win7    distVar_win8    distVar_win9    distVar_win10   distSkew_win0   distSkew_win1   distSkew_win2   distSkew_win3   distSkew_win4   distSkew_win5   distSkew_win6   distSkew_win7   distSkew_win8   distSkew_win9   distSkew_win10  distKurt_win0   distKurt_win1   distKurt_win2   distKurt_win3   distKurt_win4   distKurt_win5   distKurt_win6   distKurt_win7   distKurt_win8   distKurt_win9   distKurt_win10
0.0719178082191781  0.21917808219178084 0.0 0.0856164383561644  0.18835616438356165 0.08219178082191783 0.1438356164383562  0.11643835616438357 0.03082191780821918 0.06164383561643836 0.0 0.05    0.15000000000000002 0.0 0.1 0.2 0.05    0.15000000000000002 0.15000000000000002 0.05    0.1 0.0 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.08333333333333334 0.33333333333333337 0.0 0.04629629629629631 0.1388888888888889  0.14814814814814817 0.16666666666666669 0.055555555555555566    0.00925925925925926 0.01851851851851852 0.0 0.10937499999999997 0.046875    0.15625 0.07812499999999997 0.0 0.12499999999999999 0.06249999999999996 0.04687499999999997 0.125   0.09374999999999999 0.15625 0.12499999999999997 0.16666666666666666 0.0 0.08333333333333333 0.12499999999999997 0.16666666666666666 0.16666666666666666 0.08333333333333333 0.041666666666666664    0.041666666666666664    0.0 0.06451612903225806 0.12903225806451613 0.03225806451612903 0.0967741935483871  0.16129032258064516 0.06451612903225806 0.12903225806451613 0.12903225806451613 0.06451612903225806 0.0967741935483871  0.03225806451612903 0.08978328173374613 0.043343653250774   0.15479876160990713 0.08359133126934984 0.043343653250774   0.0804953560371517  0.05572755417956656 0.06501547987616099 0.12693498452012386 0.10216718266253873 0.15479876160990713 0.11210762331838563 0.04932735426008969 0.11210762331838563 0.0919282511210762  0.05829596412556052 0.11210762331838563 0.07399103139013453 0.07399103139013453 0.11210762331838563 0.09192825112107622 0.11210762331838563 0.08151905190399744 0.22514785763961184 0.0 0.04864305566287911 0.22514785763961184 0.16164461574125985 0.16052208368750104 0.07504928587987068 0.0064066463555987165   0.015919545489669538    0.0 0.0 0.5568280841528694  0.0 0.04705589443545375 0.19394995642972732 0.0 0.14291049421137805 0.03834183991036973 0.0 0.02091373086020167 0.0 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.09090909090909091 0.06243805748265611 0.25074331020812685 0.0 0.08424182358771062 0.18830525272547072 0.06243805748265611 0.13825569871159565 0.10208126858275522 0.04013875123885036 0.0713577799801784  0.0 0.04857406179761379 0.05112875308219224 0.024287030898806895    0.10935834906143639 0.09252609663298561 0.0 0.10299833267553081 0.0722955867550393  0.2969082765947904  0.1776364816027976  0.024287030898806895    0.0 0.06485265967946549 0.13601903476500846 0.08731269565388747 0.10368307290853335 0.0 0.12327286878258568 0.08940117383601429 0.15317458869933398 0.10626487091016294 0.13601903476500846

Now, take the same exact data, but change mutations to be on [1, 2):

/usr/local/bin/mspms 10 1 -t 10
761793949 540378133 223878765

//
segsites: 20
positions: 1.006 1.094 1.102 1.133 1.303 1.339 1.383 1.383 1.426 1.440 1.538 1.563 1.593 1.604 1.667 1.696 1.701 1.771 1.828 1.847 
10010000000100001000
01000010001001000000
01000010001001000000
00000101100000100000
10110000000000010000
01000010001011000001
00000101000000100010
01001000001001000000
10110000010000000100
00010000000000000000

Get the stats:

python3 diploSHIC.py fvecSim haploid out2.ms stats2
/usr/bin/python3 makeFeatureVecsForSingleMs_ogSHIC.py out2.ms 1100000 11 None None all 0.25 0.0 None stats2
file name='out2.ms'maskFileName='None': not doing any masking!
makeFeatureVecsForSingleMs_ogSHIC.py:108: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
  start = time.clock()
[1099981, 1099982, 1099983, 1099984, 1099985, 1099986, 1099987, 1099988, 1099989, 1099990, 1099991, 1099992, 1099993, 1099994, 1099995, 1099996, 1099997, 1099998, 1099999, 1100000]
makeFeatureVecsForSingleMs_ogSHIC.py:209: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
  time.clock()-start))
total time spent calculating summary statistics and generating feature vectors: 0.112435 secs

**NOTE that the positions are all "bunched up" now on the right!!!!***

The stats are very different:

pi_win0 pi_win1 pi_win2 pi_win3 pi_win4 pi_win5 pi_win6 pi_win7 pi_win8 pi_win9 pi_win10    thetaW_win0 thetaW_win1 thetaW_win2 thetaW_win3 thetaW_win4 thetaW_win5 thetaW_win6 thetaW_win7 thetaW_win8 thetaW_win9 thetaW_win10    tajD_win0   tajD_win1   tajD_win2   tajD_win3   tajD_win4   tajD_win5   tajD_win6   tajD_win7   tajD_win8   tajD_win9   tajD_win10  thetaH_win0 thetaH_win1 thetaH_win2 thetaH_win3 thetaH_win4 thetaH_win5 thetaH_win6 thetaH_win7 thetaH_win8 thetaH_win9 thetaH_win10    fayWuH_win0 fayWuH_win1 fayWuH_win2 fayWuH_win3 fayWuH_win4 fayWuH_win5 fayWuH_win6 fayWuH_win7 fayWuH_win8 fayWuH_win9 fayWuH_win10    maxFDA_win0 maxFDA_win1 maxFDA_win2 maxFDA_win3 maxFDA_win4 maxFDA_win5 maxFDA_win6 maxFDA_win7 maxFDA_win8 maxFDA_win9 maxFDA_win10    HapCount_win0   HapCount_win1   HapCount_win2   HapCount_win3   HapCount_win4   HapCount_win5   HapCount_win6   HapCount_win7   HapCount_win8HapCount_win9  HapCount_win10  H1_win0 H1_win1 H1_win2 H1_win3 H1_win4 H1_win5 H1_win6 H1_win7 H1_win8 H1_win9 H1_win10    H12_win0    H12_win1    H12_win2    H12_win3    H12_win4    H12_win5    H12_win6    H12_win7    H12_win8    H12_win9    H12_win10   H2/H1_win0  H2/H1_win1  H2/H1_win2  H2/H1_win3  H2/H1_win4  H2/H1_win5  H2/H1_win6  H2/H1_win7  H2/H1_win8  H2/H1_win9  H2/H1_win10 ZnS_win0    ZnS_win1    ZnS_win2    ZnS_win3    ZnS_win4    ZnS_win5    ZnS_win6    ZnS_win7    ZnS_win8    ZnS_win9    ZnS_win10   Omega_win0  Omega_win1  Omega_win2  Omega_win3  Omega_win4  Omega_win5  Omega_win6  Omega_win7  Omega_win8  Omega_win9  Omega_win10 distVar_win0    distVar_win1    distVar_win2    distVar_win3    distVar_win4    distVar_win5    distVar_win6    distVar_win7    distVar_win8    distVar_win9    distVar_win10   distSkew_win0   distSkew_win1   distSkew_win2   distSkew_win3   distSkew_win4   distSkew_win5   distSkew_win6   distSkew_win7   distSkew_win8   distSkew_win9   distSkew_win10  distKurt_win0   distKurt_win1   distKurt_win2   distKurt_win3   distKurt_win4   distKurt_win5   distKurt_win6   distKurt_win7   distKurt_win8   distKurt_win9   distKurt_win10
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.09999999999999998 0.09999999999999998 0.09999999999999998 0.09999999999999998 0.09999999999999998 0.09999999999999998 0.09999999999999998 0.09999999999999998 0.09999999999999998 0.09999999999999998 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.05263157894736842 0.05263157894736842 0.05263157894736842 0.05263157894736842 0.05263157894736842 0.05263157894736842 0.05263157894736842 0.05263157894736842 0.05263157894736842 0.05263157894736842 0.47368421052631576 0.09881422924901187 0.09881422924901187 0.09881422924901187 0.09881422924901187 0.09881422924901187 0.09881422924901187 0.09881422924901187 0.09881422924901187 0.09881422924901187 0.09881422924901187 0.011857707509881426    0.09842519685039369 0.09842519685039369 0.09842519685039369 0.09842519685039369 0.09842519685039369 0.09842519685039369 0.09842519685039369 0.09842519685039369 0.09842519685039369 0.09842519685039369 0.015748031496062995    0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.0 0.09999999999999998 0.09999999999999998 0.09999999999999998 0.09999999999999998 0.09999999999999998 0.09999999999999998 0.09999999999999998 0.09999999999999998 0.09999999999999998 0.09999999999999998 0.0

cc @khoihuynh-thorntonlab