stschiff / msmc

Implementation of the multiple sequential markovian coalescent
GNU General Public License v3.0
87 stars 20 forks source link

Couple issues understanding decode output #22

Closed jthuff closed 6 years ago

jthuff commented 8 years ago

Hello,

Thanks Stephan for making such useful software.

I ran the decode program for tMRCAs of each chromosomes segment for data from an unphased diploid.

My first issue is about the time intervals: I noticed that the output matrix seems to be just the genomic intervals (rows) by time intervals (columns). The guide mentions the first row is the header with time boundaries, but that doesn't appear to be the case looking at the output. Please correct me if I'm wrong. Assuming the time boundaries are not there, do you know where I should get them? Are they the same intervals constructed in the msmc (psmc') run if use the same exact -m and -r values in decode as those reported by msmc?

The second issue is that I just want to verify which time interval is the most recent -- it's the leftmost, correct?

Thanks, Jason

stschiff commented 8 years ago

Hi Jason,

for me it works. Here is my command line and output (first 10 lines):

build/decode -t 10 -m 0.001 -r 0.001 --nrThreads 8 input_file.txt | head
generating propagation core
generating Hidden Markov Model
#time_boundaries:   0   0.0175601   0.0371906   0.0594458   0.0851376   0.115525    0.152715    0.200662    0.26824 0.383764    
1845    0.923997    0.0449414   0.0126274   0.00682254  0.00432511  0.00290508  0.00197339  0.00130416  0.000787414 0.00031665  
2845    0.932081    0.0442336   0.011268    0.00546186  0.00307113  0.00180618  0.00105958  0.000597521 0.000307746 0.000112964 
3845    0.938086    0.0432216   0.00994658  0.00432126  0.00216526  0.00113318  0.000596643 0.000310684 0.000156939 6.20353e-05 
4845    0.942853    0.042056    0.00873138  0.0034061   0.00153475  0.000731541 0.000361938 0.000186368 9.79706e-05 4.09651e-05 
5845    0.946801    0.0408116   0.00764213  0.0026866   0.00110221  0.000491958 0.000238837 0.000126386 6.9338e-05  2.98745e-05 
6845    0.950168    0.0395306   0.00667954  0.00212696  0.000806742 0.00034714  0.000170744 9.38377e-05 5.30447e-05 2.32379e-05 
7845    0.953102    0.0382388   0.00583623  0.00169414  0.000604505 0.000257586 0.000130593 7.42143e-05 4.27462e-05 1.89217e-05 
8845    0.955702    0.0369525   0.00510158  0.0013603   0.000465196 0.000200519 0.000105294 6.135e-05   3.57562e-05 1.59407e-05 
9845    0.958034    0.0356826   0.00446404  0.001103    0.00036827  0.000162855 8.83304e-05 5.2368e-05  3.07607e-05 1.37851e-05

And as you can see, the leftmost state is the one for the most recent time interval. Stephan

jthuff commented 8 years ago

Hi Stephan, Thanks for clarifying the columns. The version I had leaves out the header row (?). I named my input to run with the same command. I don't recall any compilation errors, so maybe it was just an older version:

$ build/decode -t 10 -m 0.001 -r 0.001 --nrThreads 8 input_file.txt | head
generating propagation core
generating Hidden Markov Model
0.0839459 0.275551 0.224518 0.136045 0.0882558 0.0654098 0.0507695 0.0381999 0.0258882 0.0114167 
0.0937437 0.336102 0.275062 0.146785 0.0711956 0.0369691 0.0206386 0.0114949 0.00590201 0.0021062 
0.096862 0.357485 0.293336 0.148008 0.0607291 0.0240603 0.0104742 0.00520427 0.00273743 0.00110443 
0.085065 0.33156 0.284737 0.156447 0.0745276 0.0353737 0.0173716 0.00878245 0.00443559 0.00170066 
0.0411786 0.227055 0.227586 0.16078 0.114953 0.0865855 0.0630713 0.0418735 0.0250869 0.0118297 
0.020529 0.117677 0.132152 0.118055 0.116024 0.121507 0.123938 0.116359 0.0928101 0.0409486 
2.68557e-05 0.000656793 0.00418891 0.0143457 0.0354282 0.0709161 0.121868 0.188707 0.268028 0.295835 
0.00643857 0.0435094 0.0923564 0.137901 0.170269 0.181573 0.165589 0.122427 0.0644121 0.0155252 
0.000213611 0.00468051 0.0173778 0.0375937 0.0621115 0.0874507 0.112869 0.146009 0.210253 0.321441 
7.91957e-06 0.000159891 0.000751299 0.00231555 0.00611549 0.0155559 0.039697 0.101496 0.25711 0.576791

...because I cloned just now from github and recompiled build/decode, and it works perfectly! Thanks

$ build/decode -t 10 -m 0.001 -r 0.001 --nrThreads 8 input_file.txt | head
generating propagation core
generating Hidden Markov Model
#time_boundaries:   0   0.105361    0.223144    0.356675    0.510826    0.693147    0.916291    1.20397 1.60944 2.30259 
178 0.0839459   0.275551    0.224518    0.136045    0.0882558   0.0654098   0.0507695   0.0381999   0.0258882   0.0114167   
1178    0.0937437   0.336102    0.275062    0.146785    0.0711956   0.0369691   0.0206386   0.0114949   0.00590201  0.0021062   
2178    0.096862    0.357485    0.293336    0.148008    0.0607291   0.0240603   0.0104742   0.00520427  0.00273743  0.00110443  
3178    0.085065    0.33156 0.284737    0.156447    0.0745276   0.0353737   0.0173716   0.00878245  0.00443559  0.00170066  
4178    0.0411786   0.227055    0.227586    0.16078 0.114953    0.0865855   0.0630713   0.0418735   0.0250869   0.0118297   
5178    0.020529    0.117677    0.132152    0.118055    0.116024    0.121507    0.123938    0.116359    0.0928101   0.0409486   
6178    2.68557e-05 0.000656793 0.00418891  0.0143457   0.0354282   0.0709161   0.121868    0.188707    0.268028    0.295835    
7178    0.00643857  0.0435094   0.0923564   0.137901    0.170269    0.181573    0.165589    0.122427    0.0644121   0.0155252   
8178    0.000213611 0.00468051  0.0173778   0.0375937   0.0621115   0.0874507   0.112869    0.146009    0.210253    0.321441
jthuff commented 8 years ago

And one more quick question about the output. I see now there's also row numbers, which is great, because I'd like to find the precise intervals that overlap a set of genomic elements. Do these row numbers indicate that the chromosome intervals are as follows? 1-178 179-1178 1179-2178 2179-3178 3179-4178 4179-5178 5179-6178 6179-7178 7179-8178

Thanks again

stschiff commented 8 years ago

Strictly, these are not actually windows across the chromosome, but estimates of the tMRCA at a particular position, indicated by the first column and spread every 1000 bp starting with the first segregating site. In practice what you are suggesting above would be OK, however, because the tMRCA is not expected to change radically from one position to the next. I would probably take one estimate to be the same for the entire window coming before or after it, as you suggest.

jthuff commented 8 years ago

OK, I understand better now I think. I'll try a couple different ways to overlap "windows" for getting the subset of tMRCA values to make sure that the answer is robust to slight differences in the procedure. As you point out, the estimated tMRCA shouldn't change radically over short chromosome distances in most cases, so a variety of procedures will hopefully give similar results.

Best, Jason