tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
179 stars 86 forks source link

SMC_prime and record_full_arg option #709

Closed ismael-mendoza closed 5 years ago

ismael-mendoza commented 5 years ago

Hi,

I am trying to understand how the simulation of the smc_prime model and the record_full_arg go together in the output produced. For instance, when I run a command like:

ts = msprime.simulate(sample_size=3, Ne=1,
                                 length=1, recombination_rate=0.6, model='smc_prime', random_seed = 13, record_full_arg=True)

I get an output table and trees :

id  flags   population  individual  time    metadata
0   1   0   -1  0.00000000000000    
1   1   0   -1  0.00000000000000    
2   1   0   -1  0.00000000000000    
3   0   0   -1  0.62320971329399    
4   131072  0   -1  2.07225561292808    
5   131072  0   -1  2.07225561292808    
6   131072  0   -1  2.23677155238815    
7   131072  0   -1  2.23677155238815    
8   262144  0   -1  2.70109242100293    
9   0   0   -1  3.75441425511497    
10  0   0   -1  4.58649750488411    
image image image

(The trees were printed in this order in my output)

If I am understanding this correctly, the first recombination event (the first and second tree above) involved nodes 6 and 7 where the tree did not change because it was an invisible transition. I am guessing that the second recombination event involved nodes 4,5 where the pruned branch at 4 was regrafted somewhere above node 9 and ended up in node 10. (or maybe these recombination nodes do not indicate the pruning site at all? )

Another question has to do with node 8 which has the flag of a NODE_IS_CA_EVENT. As I understand it, the SMC prime process does not involve the ARG and so I am confused why there would be a CA event that does not produce a coalescence in the tree for a SMC prime process?

I was also wondering if one was to remove all of the NODE_IS_CA_EVENT and NODE_IS_RE_EVENT nodes from each of the trees obtained from the output of simulate with record_full_arg=True and model=smc_prime options, would what is left be the same trees as running a standard SMC prime process or would there be slightly different given that the ARG is involved in their production in some way?

Sorry for the long post, and thanks in advance for your help in understanding this functionality better!

jeromekelleher commented 5 years ago

If I am understanding this correctly, the first recombination event (the first and second tree above) involved nodes 6 and 7 where the tree did not change because it was an invisible transition. I am guessing that the second recombination event involved nodes 4,5 where the pruned branch at 4 was regrafted somewhere above node 9 and ended up in node 10. (or maybe these recombination nodes do not indicate the pruning site at all? )

Hmmm. I'm not sure what's going on here; @JereKoskela understands this much better than I do: what do you think happened here Jere?

Another question has to do with node 8 which has the flag of a NODE_IS_CA_EVENT. As I understand it, the SMC prime process does not involve the ARG and so I am confused why there would be a CA event that does not produce a coalescence in the tree for a SMC prime process?

I don't understand this either; we shouldn't have these CA events in the SMC. This sounds like a bug to me.

I was also wondering if one was to remove all of the NODE_IS_CA_EVENT and NODE_IS_RE_EVENT nodes from each of the trees obtained from the output of simulate with record_full_arg=True and model=smc_prime options, would what is left be the same trees as running a standard SMC prime process or would there be slightly different given that the ARG is involved in their production in some way?

If you run simplify() on the returned tree sequence from record_full_arg=True, you will get a tree sequence with identical statistical properties are you would have with record_full_arg=False. The catch is that the tree sequences themselves won't be identical because the simulation will be put on a slightly different random trajectory by storing the full ARG. It's much like what happens when you run two different random seeds: the tree sequences are both samples from the distribution, but they differ in details. Does this make sense?

JereKoskela commented 5 years ago

If I am understanding this correctly, the first recombination event (the first and second tree above) involved nodes 6 and 7 where the tree did not change because it was an invisible transition. I am guessing that the second recombination event involved nodes 4,5 where the pruned branch at 4 was regrafted somewhere above node 9 and ended up in node 10. (or maybe these recombination nodes do not indicate the pruning site at all? )

If we measure time from the leaves towards the root, the first recombination involves branch 1 splitting into branches 4 and 5. The two ancestors involved with the recombination event are marked with the NODE_IS_RE_EVENT flag, and the child is recorded in the edge table. Node 4 then recombines again to form nodes 6 and 7, which merge back into node 8 before anything else happens to either of them. Thus, the recombination event from 1 to 4 and 5 has an effect on the tree, while the recombination from 4 to 6 and 7 does not.

Another question has to do with node 8 which has the flag of a NODE_IS_CA_EVENT. As I understand it, the SMC prime process does not involve the ARG and so I am confused why there would be a CA event that does not produce a coalescence in the tree for a SMC prime process?

This is a merger event of ancestral segments that touch, but don't overlap, i.e. [a, b) and [b, c). That's allowed in the SMC, unless I'm mistaken? It would never show up in default msprime output because there is no coalescence of ancestral material, but the common ancestor event is still between two adjacent segments.

I was also wondering if one was to remove all of the NODE_IS_CA_EVENT and NODE_IS_RE_EVENT nodes from each of the trees obtained from the output of simulate with record_full_arg=True and model=smc_prime options, would what is left be the same trees as running a standard SMC prime process or would there be slightly different given that the ARG is involved in their production in some way?

The realisations will be different, but statistically both ways of simulating output are identical. Of course, you have to be careful to e.g. update the parents of edges to point to ancestors of deleted nodes, rather than to nodes which no longer exist. As I understand it, this is exactly what the simplify() function does.

ismael-mendoza commented 5 years ago

Thank you @jeromekelleher and @JereKoskela, your answers have been very helpful in me understanding both the ARG and your algorithm. I have a couple of follow up questions which are a bit more specific:

This is a merger event of ancestral segments that touch, but don't overlap, i.e. [a, b) and [b, c). That's allowed in the SMC, unless I'm mistaken? It would never show up in default msprime output because there is no coalescence of ancestral material, but the common ancestor event is still between two adjacent segments.

The realisations will be different, but statistically both ways of simulating output are identical. Of course, you have to be careful to e.g. update the parents of edges to point to ancestors of deleted nodes, rather than to nodes which no longer exist. As I understand it, this is exactly what the simplify() function does.

Thanks again for your help answering my questions!

jeromekelleher commented 5 years ago

Is this merger event the only common ancestor event allowed in a simulation of the SMC prime process that does not result in a coalescence in the marginal trees?

Yes, that's true. If you used the "smc" model, then you could never have any CA arg events. See the code for details of how the models are implemented

Does this type of CA event also indicate that a hidden transition occurred in the SMC prime process? (two identical adjacent trees have exactly the same coalescent times?)

I don't think that's necessarily true. You'll have a unary node in the relevant trees, but not any extra trees (besides those produced by the recombination event). (I think, I'm not 100% confident on that one.)

I would like to do something like what simplify() does but without squashing together identical adjacent trees, is this currently possible in msprime?

I don't think there's any way to prevent simplify from squashing identical adjacent trees. What nodes do you want to remove, and why do you want to remove them?

ismael-mendoza commented 5 years ago

Thanks @jeromekelleher,

Does this type of CA event also indicate that a hidden transition occurred in the SMC prime process? (two identical adjacent trees have exactly the same coalescent times?)

I don't think that's necessarily true. You'll have a unary node in the relevant trees, but not any extra trees (besides those produced by the recombination event). (I think, I'm not 100% confident on that one.)

Yeah sorry I should have been more specific. For any given run of 'SMC prime', I am observing that the total number of nodes (in the node table) with the NODE_IS_CA_EVENT flag is always equal to the total number of pairs of identical adjacent trees (have all the same coalescent times) for that run. So this made me think that there is some relationship between trees with a NODE_IS_CA_EVENT node and whether a hidden transition occurred.

I don't think there's any way to prevent simplify from squashing identical adjacent trees. What nodes do you want to remove, and why do you want to remove them?

I guess I am trying to remove all the NODE_IS_CA_EVENT and NODE_IS_RE_EVENT nodes without removing the latent trees / hidden transitions for the 'SMC prime' process. The reason I want to remove them is to obtain the Newick format of these trees without the extra nodes. But now that I think about it, I could use simplify to obtain the Newick format of the trees and record where the hidden transitions occur from the full arg output which would be all I need.

JereKoskela commented 5 years ago

Yeah sorry I should have been more specific. For any given run of 'SMC prime', I am observing that the total number of nodes (in the node table) with the NODE_IS_CA_EVENT flag is always equal to the total number of pairs of identical adjacent trees (have all the same coalescent times) for that run. So this made me think that there is some relationship between trees with a NODE_IS_CA_EVENT node and whether a hidden transition occurred.

I think that's right when the locus is continuous and the recombination map is given by a density, which is the default for msprime.simulate. In that setting, the only way that a NODE_IS_CA_EVENT node can arise is if a recombination is undone by a merger before either ancestor merges with anyone else. If the recombination map is discrete, then exactly the same point can recombine many times, and you can get NODE_IS_CA_EVENT nodes even when the adjacent trees are not equal.

jeromekelleher commented 5 years ago

I guess I am trying to remove all the NODE_IS_CA_EVENT and NODE_IS_RE_EVENT nodes without removing the latent trees / hidden transitions for the 'SMC prime' process. The reason I want to remove them is to obtain the Newick format of these trees without the extra nodes. But now that I think about it, I could use simplify to obtain the Newick format of the trees and record where the hidden transitions occur from the full arg output which would be all I need.

I think this is the simplest thing to do all right. OK, looks like this is resolved so going to close the issue.