tskit-dev / tskit

Population-scale genomics
MIT License
153 stars 72 forks source link

Add ``keep_input_roots`` option to simplify #775

Closed jeromekelleher closed 4 years ago

jeromekelleher commented 4 years ago

Following from the discussion in #716, it seems like a lot of the confusion surrounding missing data is due to simplify not having quite the right semantics for the forward simulation case. For forward simulations, we really do want to know about the branches back to the original generation, but these are removed by simplify as it gets rid of any topology above the roots. So, I think we want to have an option to simplify called keep_input_roots which preserves the roots of the trees in the input tree sequence. The easy way to do this is something like

        all_roots = set()
        for tree in ts.trees():
            for root in tree.roots:
                all_roots.add(root)        
        ts.simplify(samples + list(all_roots))

This doesn't do quite what we'd want though, I think. Consider this example

1.76┊    14       ┊    14       ┊     14      ┊    14       ┊             ┊  
    ┊  ┏━━┻━┓     ┊  ┏━━┻━━━┓   ┊   ┏━━┻━━┓   ┊  ┏━━┻━━┓    ┊             ┊  
1.51┊  ┃    ┃     ┊  ┃     13   ┊   ┃    13   ┊  ┃    13    ┊      13     ┊  
    ┊  ┃    ┃     ┊  ┃    ┏━┻━┓ ┊   ┃    ┏┻━┓ ┊  ┃   ┏━┻━┓  ┊    ┏━━┻━━┓  ┊  
1.27┊  ┃    ┃     ┊  ┃    ┃   ┃ ┊   ┃    ┃  ┃ ┊  ┃  12   ┃  ┊    ┃    12  ┊  
    ┊  ┃    ┃     ┊  ┃    ┃   ┃ ┊   ┃    ┃  ┃ ┊  ┃  ┏┻┓  ┃  ┊    ┃    ┏┻┓ ┊  
1.21┊  ┃    ┃     ┊  ┃    ┃   ┃ ┊   ┃    ┃  ┃ ┊  ┃  ┃ ┃  ┃  ┊   11    ┃ ┃ ┊  
    ┊  ┃    ┃     ┊  ┃    ┃   ┃ ┊   ┃    ┃  ┃ ┊  ┃  ┃ ┃  ┃  ┊  ┏━┻━┓  ┃ ┃ ┊  
0.64┊  ┃    ┃     ┊  ┃    ┃   ┃ ┊  10    ┃  ┃ ┊  ┃  ┃ ┃  ┃  ┊  ┃   ┃  ┃ ┃ ┊  
    ┊  ┃    ┃     ┊  ┃    ┃   ┃ ┊  ┏┻━┓  ┃  ┃ ┊  ┃  ┃ ┃  ┃  ┊  ┃   ┃  ┃ ┃ ┊  
0.49┊  ┃    9     ┊  ┃    9   ┃ ┊  ┃  ┃  ┃  ┃ ┊  ┃  ┃ ┃  ┃  ┊  ┃   ┃  ┃ ┃ ┊  
    ┊  ┃  ┏━┻━┓   ┊  ┃  ┏━┻┓  ┃ ┊  ┃  ┃  ┃  ┃ ┊  ┃  ┃ ┃  ┃  ┊  ┃   ┃  ┃ ┃ ┊  
0.31┊  ┃  ┃   8   ┊  ┃  ┃  8  ┃ ┊  ┃  ┃  8  ┃ ┊  ┃  ┃ ┃  8  ┊  ┃   8  ┃ ┃ ┊  
    ┊  ┃  ┃ ┏━┻┓  ┊  ┃  ┃ ┏┻┓ ┃ ┊  ┃  ┃ ┏┻┓ ┃ ┊  ┃  ┃ ┃ ┏┻┓ ┊  ┃  ┏┻┓ ┃ ┃ ┊  
0.26┊  ┃  ┃ ┃  7  ┊  ┃  ┃ ┃ ┃ ┃ ┊  ┃  ┃ ┃ ┃ ┃ ┊  ┃  ┃ ┃ ┃ ┃ ┊  ┃  ┃ ┃ ┃ ┃ ┊  
    ┊  ┃  ┃ ┃ ┏┻┓ ┊  ┃  ┃ ┃ ┃ ┃ ┊  ┃  ┃ ┃ ┃ ┃ ┊  ┃  ┃ ┃ ┃ ┃ ┊  ┃  ┃ ┃ ┃ ┃ ┊  
0.03┊  6  ┃ ┃ ┃ ┃ ┊  6  ┃ ┃ ┃ ┃ ┊  6  ┃ ┃ ┃ ┃ ┊  6  ┃ ┃ ┃ ┃ ┊  6  ┃ ┃ ┃ ┃ ┊  
    ┊ ┏┻┓ ┃ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┃ ┊  
0.00┊ 0 5 1 2 3 4 ┊ 0 5 1 2 4 3 ┊ 0 5 1 2 4 3 ┊ 0 5 1 3 2 4 ┊ 0 5 2 4 1 3 ┊  
  0.00          0.18          0.37          0.75          0.91          1.00 

when we simplify down for samples = [0, 5] (plus the roots) we get

1.76┊  3    ┊   3   ┊     3 ┊  
    ┊  ┃    ┊  ┏┻━┓ ┊       ┊  
1.51┊  ┃  2 ┊  ┃  2 ┊  2    ┊  
    ┊  ┃    ┊  ┃    ┊  ┃    ┊  
0.03┊  4    ┊  4    ┊  4    ┊  
    ┊ ┏┻┓   ┊ ┏┻┓   ┊ ┏┻┓   ┊  
0.00┊ 0 1   ┊ 0 1   ┊ 0 1   ┊  
  0.00    0.18    0.91    1.00 

(If this looks weird in your browser, try a different one. Curse the modern web and lack of reliable monospace fonts!)

The weirdness in how the trees are drawn here is because the old roots are samples, but if we marked them as non-samples the odd topologies would still be there. What do we do with nodes that are roots in one part of the tree sequence but are internal nodes elsewhere? We probably don't want to leave them out of the edges, as this would be misleading, so this means we kind-of have to mark the input roots as special beforehand. If we just want to mark the nodes at the top of the trees, then in principle this information is available in simplify at the end of the algorithm, but it's not obvious to me how to make use of it.

Not sure what to do here. Perhaps it would be best if the forward simulators made a pass through the trees first to get the roots, like the example above? In practise, this would just be a list of the first generation individuals who have any descendants. The cost would be to (a) build the index required for iterating over the trees, which is significant; and (b) a single pass through the trees and roots, which is probably not that big a cost, unless it's a large simulation and done very early (and so there's lots of roots).

Any thoughts @molpopgen @petrelharp @bhaller?

petrelharp commented 4 years ago

Hm. Well, I think the general principle here is that simplify keeps around only minimally enough information to reconstruct the trees spanned by the samples. If there are multiple roots then keeping those roots is important because when they are tells us something about the (unobserved) rest of the tree. In other words, to recapitate correctly we'd need to know when they were.

I guess you could make the same argument about keeping the root at a coalesced tree, since even if this tree is coalesced, maybe another one hasn't. But, this is less clear to me. I was imagining only keeping the roots when there's more than one of them; I see you're imagining keeping the roots of all the trees. I think that's fine, at least. Does it turn out to be easier to do one or the other thing?

What do we do with nodes that are roots in one part of the tree sequence but are internal nodes elsewhere? We probably don't want to leave them out of the edges, as this would be misleading

Why would this be misleading? This seems similar to the case where, say, 0 inherits from 2 on [0,2] and 1 inherits from 2 on [0,1], and in the simplified tree sequence we don't know any more that 0 inherits from 2 on [1,2], since there wasn't a coalescence there, so you don't need to know it to reconstruct the trees spanned by the samples. So, in your example above, in the simplified tree sequence I think we only need edges attached to node 2 in the most right-hand segment, even if 2 had inherited from 3 on the left-most segment.

If we just want to mark the nodes at the top of the trees, then in principle this information is available in simplify at the end of the algorithm, but it's not obvious to me how to make use of it.

I think that's all we want to do. What do you mean "make use of it"? I'm probably missing something.

bhaller commented 4 years ago

Well, I don't really understand all the issues surrounding simplification; I've never delved deeply into that (although I probably should, one of these days). But I guess my first reaction to this is: if it ain't broke, don't fix it. As I understand it, the changes to simplify proposed here would not remove the need for the flag that says "don't consider isolated samples to be missing"; that flag will still be needed/used in SLiM and other forward simulators. And once that flag is used, simplify seems to be doing what we want it to do; I don't (think?) I have a problem with the way SLiM presently keeps first-gen ancestors around for recapitation, etc., and it's pretty well-established in a lot of practice (examples in both the SLiM and pyslim manuals, and probably a fair bit of end-user code now). The changes being proposed here sound like they would involve the additional of new pre-processing steps that would potentially be slow in some cases; and they sound potentially destabilizing, with knock-on consequences we might not be foreseeing now, and since the result of simplify would change it would potentially break backward compatibility for end users. And I'm not seeing what the counterbalancing win is, to justify that. Sorry if I'm just being dense, but can somebody explain what the real, concrete need for this change is?

molpopgen commented 4 years ago

(I'm not paying close attention these days b/c the quarter starts in 6 weeks...)

I just want to point out that anything involving a tree traversal first will become very inefficient for simulations with lots of ancient samples. I (and I think Peter, too) have done a lot of "remember everyone for a while" simulations, and the tree traversal becomes a bottleneck in those cases.

molpopgen commented 4 years ago

I have a problem with the way SLiM presently keeps first-gen ancestors around for recapitation, etc., and it's pretty well-established in a lot of practice (examples in both the SLiM and pyslim manuals, and probably a fair bit of end-user code now)

fwdpy11 supports both this and "pre-capitation" (starting from a tskit tree sequence). The fwdpy11.infinite_sites function also avoids the missing data problem by first doing a tree traversal, identifying trees w/multiple roots, and then assigning mutations above those root nodes as needed.

bhaller commented 4 years ago

@molpopgen, note that what I wrote was "I don't (think?) I have a problem with the way...", so your quoting above reverses my intended meaning.

molpopgen commented 4 years ago

so your quoting above reverses my intended meaning.

yep, the r command sometimes fails in GitHub.

jeromekelleher commented 4 years ago

Why would this be misleading? This seems similar to the case where, say, 0 inherits from 2 on [0,2] and 1 inherits from 2 on [0,1], and in the simplified tree sequence we don't know any more that 0 inherits from 2 on [1,2], since there wasn't a coalescence there, so you don't need to know it to reconstruct the trees spanned by the samples. So, in your example above, in the simplified tree sequence I think we only need edges attached to node 2 in the most right-hand segment, even if 2 had inherited from 3 on the left-most segment.

Yes, this is fair @petrelharp - I think you're right, although there's some tricky subtleties probably when it comes to unary nodes.

If we just want to mark the nodes at the top of the trees, then in principle this information is available in simplify at the end of the algorithm, but it's not obvious to me how to make use of it.

I think that's all we want to do. What do you mean "make use of it"? I'm probably missing something.

Well, the information is there in that I can "see" the edges that we want to add quite clearly in the ancestry array, but I don't know how to distinguish root nodes from non-roots. The idea is that the ancestry has propagated upwards through the tree sequence, and the stuff that's left over at the roots will be at the "top" of the ancestry array. I've thought about it a bit though, and it's not obvious to me how to define the "top" of the ancestry, even in the case of single-rooted trees. There must be a way though...

jeromekelleher commented 4 years ago

@bhaller - the proposal is to add an option to simplify that does a cheap final step in which we add edges to the roots of the output trees, joining them to the oldest nodes in the input trees that were ancestral to those roots. It's not an expensive or intrusive thing (well, if we can figure out how to make it work!). It's really to address the fundamentally unsatisfactory outcome of having nodes that look like missing data after you've run a simulation for a long time - this just shouldn't be the case. So, by using this option, nodes output by a forward simulator will always have a path back to the node in the first generation, which seems generally like something you want (indeed, that's what you do in SLiM by various devices, as I understand it). A tree sequence output from a forward simulation would likewise never have any missing data in it, assuming it's been run for at least one generation. I guess you could have some weird things like the original individuals persist in the population for a while in an overlapping generations model, and their genotypes will decode as missing data if they have no mutations. That seems defensible to me, though.

petrelharp commented 4 years ago

Re: SLiM - adding this ability to simplify would actually remove the requirement that SLiM need to keep around the first generation, which is a significant pain and a major source of confusion to users. It could also make recapitation go substantially faster, since if someone wants 100 samples from a population of 100,000, they currently need to recapitate with 100,000 and then simplify; but with this ability, they could simplify to 100 first, and then recapitate. If the simulation time period in SLiM is short, this will go a lot faster.

jeromekelleher commented 4 years ago

I was thinking it would simplify this stuff quite a bit all right, thanks @petrelharp - any thoughts on feasability?

petrelharp commented 4 years ago

I don't know how to distinguish root nodes from non-roots.

Oh right - is this because in the simplify algorithm, nodes acquire ancestry but never give it away?

jeromekelleher commented 4 years ago

Oh right - is this because in the simplify algorithm, nodes acquire ancestry but never give it away?

Exactly!

petrelharp commented 4 years ago

I'm having a look - the python version is up to date, right?

bhaller commented 4 years ago

OK; those do sound like somewhat compelling arguments. One concern I still have is the breaking of backward compatibility; results from SLiM will be significantly different as I understand it (not all of the first-gen ancestors will be present, for one thing), and it's quite possible that that change would break existing scripts, right? @petrelharp, do you have a sense of what the likely impact would be for existing users? Do people use those first-gen ancestors for anything besides recapitation?

And the other concern is performance; yes, I can see that recapitation could go faster, which is great, but it also sounds like simplification during forward simulation could go substantially slower because of the additional bookkeeping, and it sounds like that's a concern for @molpopgen too ("anything involving a tree traversal first will become very inefficient for simulations with lots of ancient samples").

What timeframe are we talking about doing this in? Are you guys thinking 0.3.0, or has that ship sailed?

petrelharp commented 4 years ago

I think the consequences for SLiM's internal workings are a separate discussion... but: users can still remember the first generation explicitly. Perhaps we'd deprecate the whole FIRST_GEN thing at the next big release. And, no - as far as I can tell, those first gen samples just confuse people.

jeromekelleher commented 4 years ago

And the other concern is performance; yes, I can see that recapitation could go faster, which is great, but it also sounds like simplification during forward simulation could go substantially slower because of the additional bookkeeping, and it sounds like that's a concern for @molpopgen too ("anything involving a tree traversal first will become very inefficient for simulations with lots of ancient samples").

No, the point is that we wouldn't need to do any tree traversals @bhaller - I just used them above as a simple way of illustrating what the idea was. If it works, it would be a cheap addition to simplify, with essentially zero extra overhead. If it works, mind - we don't know how to do it yet!

WRT to timeframes, well it might be nice to make this change for the current version of SLiM you're working on since there's a bunch of other tskit changes going in as well. Well have to push out another release of the tskit C API anyway to get the bugfixes for missing data in (#778), and as well as that there's much less need to worry about missing data for SLiM users if we figure out how to do it.

bhaller commented 4 years ago

I think the consequences for SLiM's internal workings are a separate discussion... but: users can still remember the first generation explicitly. Perhaps we'd deprecate the whole FIRST_GEN thing at the next big release. And, no - as far as I can tell, those first gen samples just confuse people.

Well, this is not just about SLiM's "inner workings". If the point of this is to fix the way that simplification works for forward simulators, but it has unacceptable side effects for the users of those forward simulators, then that would render the whole feature rather pointless, no? Yes, for users of SLiM there would be workarounds; they could remember the first generation, and modify their Python scripts accordingly. But that is breakage of existing behavior, forcing people to change how they are doing things rather than preserving backward compatibility, and as we were just discussing on the tskit list, that's generally to be avoided – some were advocating that it is always to be avoided, IIRC, although I don't subscribe to that view myself. It sounds to me like the benefits may outweigh the costs, but I don't think that cost-benefit analysis is a separate discussion; it is central to whether this should be done at all, in my opinion. So I want to think the consequences through and have that discussion.

That said, it sounds like it may well be a worthwhile change. I'd certainly like to be able to get rid of all that first-gen cruft and the confusion surrounding it!

Another area where backward compatibility would perhaps be broken would be in exactly what mutation overlay after a SLiM simulation does, right? Right now in the SLiM manual we have a discussion of mutation overlay that draws a distinction: you can overlay mutations just back to the point of coalescence by simplifying first (stripping away everything above coalescence) and then overlaying, or you can overlay mutations all the way back to the first generation of forward simulation by not simplifying first. Would this story change? (Perhaps most saliently: would the new simplify behavior be the default, or would the old behavior be the default, in Python? If the new behavior is the default, then as I understand it existing scripts to overlay just to the point of coalescence would no longer work correctly, because the ancestors above coalescence would no longer be stripped away; is that correct? If the old behavior would be the default in Python, that allays some of my concerns.)

No, the point is that we wouldn't need to do any tree traversals @bhaller - I just used them above as a simple way of illustrating what the idea was. If it works, it would be a cheap addition to simplify, with essentially zero extra overhead. If it works, mind - we don't know how to do it yet!

Ah, OK, great. That is very helpful.

WRT to timeframes, well it might be nice to make this change for the current version of SLiM you're working on since there's a bunch of other tskit changes going in as well. Well have to push out another release of the tskit C API anyway to get the bugfixes for missing data in (#778), and as well as that there's much less need to worry about missing data for SLiM users if we figure out how to do it.

I agree; if we do this, it would be great to roll it in with other tskit changes to make more of a simple, clean break with the past. I am happy to hold SLiM 3.5 for this, if we do decide to go forward with it.

petrelharp commented 4 years ago

Re: feasibility - I don't see an easy way to do this. Maybe I'm missing something, but I think we'd have to either also have nodes give away ancestry as the algorithm goes along, or do a post-processing step that would be equivalent to iterating over the trees. Both of those things would be substantial work and make it go slower. I propose marking this as an enhancement that we'd like to do in the future.

jeromekelleher commented 4 years ago

:disappointed: I'll have one more look, but I agree if we don't see a way forward then we should walk away from it for now. It'd be a pity, seemed like such an elegant solution to quite a few problems.

gtsambos commented 4 years ago

Hmm, so there may be some fancy way to do this by deleting segments, but I feel like this should be possible anyway with relatively minor changes to simplify -- you just need a record of which nodes have actually come from this initial generation of the simulation (via a node flag perhaps, or from other SLiM output)? The relevant part of the algorithm to change is here:

        for left, right, X in overlapping_segments(S):
            if len(X) == 1:
                ancestry_node = X[0].node
                if is_sample:
                    self.record_edge(left, right, output_id, ancestry_node)
                    ancestry_node = output_id
                elif self.keep_unary:
                    if output_id == -1:
                        output_id = self.record_node(input_id)
                    self.record_edge(left, right, output_id, ancestry_node)

This is currently where we decide what to do when we get up to an ancestral node in the sequence with a list of overlapping descendant segments (X) that is of length 1 (ie. it corresponds to a unary node in the tree sequence). Currently, we only record an edge in this situation if we have explicitly specified that we want to keep unary nodes (which would also solve this problem btw! But leave lots of other nodes behind as well), or if the current ancestral node has been designated as a sample of interest by the user. It shouldn't be too hard to add another if statement to list to cover the case where the current ancestral node is part of this founding generation.

gtsambos commented 4 years ago

I'm happy to give this a go, btw -- it's a fairly small change. As I said, I'm not sure what would need to happen on the SLiM/fwdpy end to ensure that a record of the founding nodes is kept, but something like a node flag seems like it would be helpful here.

benjeffery commented 4 years ago

Am I right in thinking that the changes to simplify wouldn't be needed if such nodes had a flag that marked them as they could then be considered "isolated but not missing". If you have to add a flag to enable this change to simplify, then why not skip the change to simplify and just add a flag? Or am I missing something?

gtsambos commented 4 years ago

Am I right in thinking that the changes to simplify wouldn't be needed if such nodes had a flag that marked them as they could then be considered "isolated but not missing".

I don't think it would always be trivial to determine which nodes are "isolated but not missing" without actually running through simplify first, but the bigger issue is that this is only a property of a node over a particular interval, not across the tree sequence as a whole. There might be a founding node that is isolated on some interval, but that has multiple children on another.

benjeffery commented 4 years ago

Sorry I should have said "if isolated don't consider missing".

jeromekelleher commented 4 years ago

Am I right in thinking that the changes to simplify wouldn't be needed if such nodes had a flag that marked them as they could then be considered "isolated but not missing". If you have to add a flag to enable this change to simplify, then why not skip the change to simplify and just add a flag? Or am I missing something?

Sorry I should have said "if isolated don't consider missing".

I think the handling of missing data is a side issue here @benjeffery. What's at issue here is a paradoxical result of running simplify on a forward-simulated tree sequence: at a particular position on the genome, nodes can change from being non-missing to missing. We're happy with the definition of missing data, and the discussion of that really only came about because of this weird behaviour from simplify in this particular application. So, what we're trying to do is to add an option to the simplify algorithm so that it better supports the forward simulator use-case and it doesn't produce this paradoxical result. Then, we don't need to worry so much about being able to turn off missing data handling because forward sims will almost never produce it (and it's probably a good warning flag for users anyway, if it does).

gtsambos commented 4 years ago

Hmm, so there may be some fancy way to do this by deleting segments, but I feel like this should be possible anyway with relatively minor changes to simplify

fyi, I made a pull request (#781) that does this in the Python implementation of simplify, and I added Jerome's example above as a simple test case. It seems to produce the correct output, but I haven't tested it very much.

jeromekelleher commented 4 years ago

I've also made a quick prototype in Python! See #782. This is to show what we would get when we snip out the ancestry as we go through simplify. Here's the result I get for an example:

1.76┊    14       ┊    14       ┊     14      ┊    14       ┊             ┊  
    ┊  ┏━━┻━┓     ┊  ┏━━┻━━━┓   ┊   ┏━━┻━━┓   ┊  ┏━━┻━━┓    ┊             ┊  
1.51┊  ┃    ┃     ┊  ┃     13   ┊   ┃    13   ┊  ┃    13    ┊      13     ┊  
    ┊  ┃    ┃     ┊  ┃    ┏━┻━┓ ┊   ┃    ┏┻━┓ ┊  ┃   ┏━┻━┓  ┊    ┏━━┻━━┓  ┊  
1.27┊  ┃    ┃     ┊  ┃    ┃   ┃ ┊   ┃    ┃  ┃ ┊  ┃  12   ┃  ┊    ┃    12  ┊  
    ┊  ┃    ┃     ┊  ┃    ┃   ┃ ┊   ┃    ┃  ┃ ┊  ┃  ┏┻┓  ┃  ┊    ┃    ┏┻┓ ┊  
1.21┊  ┃    ┃     ┊  ┃    ┃   ┃ ┊   ┃    ┃  ┃ ┊  ┃  ┃ ┃  ┃  ┊   11    ┃ ┃ ┊  
    ┊  ┃    ┃     ┊  ┃    ┃   ┃ ┊   ┃    ┃  ┃ ┊  ┃  ┃ ┃  ┃  ┊  ┏━┻━┓  ┃ ┃ ┊  
0.64┊  ┃    ┃     ┊  ┃    ┃   ┃ ┊  10    ┃  ┃ ┊  ┃  ┃ ┃  ┃  ┊  ┃   ┃  ┃ ┃ ┊  
    ┊  ┃    ┃     ┊  ┃    ┃   ┃ ┊  ┏┻━┓  ┃  ┃ ┊  ┃  ┃ ┃  ┃  ┊  ┃   ┃  ┃ ┃ ┊  
0.49┊  ┃    9     ┊  ┃    9   ┃ ┊  ┃  ┃  ┃  ┃ ┊  ┃  ┃ ┃  ┃  ┊  ┃   ┃  ┃ ┃ ┊  
    ┊  ┃  ┏━┻━┓   ┊  ┃  ┏━┻┓  ┃ ┊  ┃  ┃  ┃  ┃ ┊  ┃  ┃ ┃  ┃  ┊  ┃   ┃  ┃ ┃ ┊  
0.31┊  ┃  ┃   8   ┊  ┃  ┃  8  ┃ ┊  ┃  ┃  8  ┃ ┊  ┃  ┃ ┃  8  ┊  ┃   8  ┃ ┃ ┊  
    ┊  ┃  ┃ ┏━┻┓  ┊  ┃  ┃ ┏┻┓ ┃ ┊  ┃  ┃ ┏┻┓ ┃ ┊  ┃  ┃ ┃ ┏┻┓ ┊  ┃  ┏┻┓ ┃ ┃ ┊  
0.26┊  ┃  ┃ ┃  7  ┊  ┃  ┃ ┃ ┃ ┃ ┊  ┃  ┃ ┃ ┃ ┃ ┊  ┃  ┃ ┃ ┃ ┃ ┊  ┃  ┃ ┃ ┃ ┃ ┊  
    ┊  ┃  ┃ ┃ ┏┻┓ ┊  ┃  ┃ ┃ ┃ ┃ ┊  ┃  ┃ ┃ ┃ ┃ ┊  ┃  ┃ ┃ ┃ ┃ ┊  ┃  ┃ ┃ ┃ ┃ ┊  
0.03┊  6  ┃ ┃ ┃ ┃ ┊  6  ┃ ┃ ┃ ┃ ┊  6  ┃ ┃ ┃ ┃ ┊  6  ┃ ┃ ┃ ┃ ┊  6  ┃ ┃ ┃ ┃ ┊  
    ┊ ┏┻┓ ┃ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┃ ┊  
0.00┊ 0 5 1 2 3 4 ┊ 0 5 1 2 4 3 ┊ 0 5 1 2 4 3 ┊ 0 5 1 3 2 4 ┊ 0 5 2 4 1 3 ┊  
  0.00          0.18          0.37          0.75          0.91          1.00 

1.76┊  4  ┊     ┊  
    ┊  ┃  ┊     ┊  
1.51┊  ┃  ┊  3  ┊  
    ┊  ┃  ┊  ┃  ┊  
0.03┊  2  ┊  2  ┊  
    ┊ ┏┻┓ ┊ ┏┻┓ ┊  
0.00┊ 0 1 ┊ 0 1 ┊  
  0.00  0.91  1.00 

We simplify down to [0, 5], and nicely get edges back to the old roots. The current implementation in #782 is a prototype only, just to see if snipping out ancestry gives us an answer that we like. The right way to do things is to dynamically snip out the ancestry as it's used up during the algorithm.

Now, the question is, would snipping out ancestry dynamically slow things down? It might, but I'm no so sure. It might even speed things up, as we're getting rid of parts of the linked lists in self.ancestry, so we don't have to do so much iteration when we're going through the edges. On the other hand, it might slow things down, as these linked lists become more fragmented.

I think it's worth doing this because the data structures we have in memory during simplify will become more useful if we know exactly the amount of extant ancestry at all time points. Particularly for things like IBD that @gtsambos is interested in.

@petrelharp, @gtsambos, do you see what I'm getting at here? The prototype is quite a backwards way of doing things, but it was the quickest way I could think of to get something working.

jeromekelleher commented 4 years ago

Thanks for #781 @gtsambos, this is very interesting! I guess the downside there is that you do need to know who the founders are, whereas in what I'm proposing it happens naturally.

gtsambos commented 4 years ago

Hmmm, okay -- I think I'll need to go through your method tomorrow morning, as my brain is a bit fuzzy for this kind of thing so late! I'll have a closer look tomorrow, because I agree that having some idea of 'extant' ancestry could be useful for other reasons -- I'm having a bit of difficulty rn thinking about what this means when you have more than 2 samples you are interested in.

I guess the downside there is that you do need to know who the founders are, whereas in what I'm proposing it happens naturally.

Also, I don't think having to explicitly specify who the founders are is a big deal at the end of the day -- it could actually be an advantage because you could actually mean something different by this in different applications. In your illustration for example, whether or not to count node 13 as a founder is sort of a judgement call. On one hand, it isn't part of the initial generation. On the other hand, it is a root of the tree sequence.

FYI, our two PRs are giving different answers -- mine outputs this I'm pretty sure.

     4      |     4       |       4
┏┻┓  |    ┃       |      ┃
┃    ┃  |    ┃       |      3
┃    ┃  |     2       | ┏┻┓
┃    ┃  | ┏┻┓  | ┃    ┃
┃    ┃  | ┃    ┃  | ┃    ┃
0      1   |  0      1  |  0      1
gtsambos commented 4 years ago

yikes, I suck at tree sequence ascii art 😅 here is the edge table output from mine, for comparison. It is a bit different

############################################################
#   Edges                                                  #
############################################################
id  left        right       parent  child   metadata
0   0.37000000  0.75000000  2   0   
1   0.37000000  0.75000000  2   1   
2   0.91000000  1.00000000  3   0   
3   0.91000000  1.00000000  3   1   
4   0.00000000  0.37000000  4   0   
5   0.75000000  0.91000000  4   0   
6   0.00000000  0.37000000  4   1   
7   0.75000000  0.91000000  4   1   
8   0.37000000  0.75000000  4   2
gtsambos commented 4 years ago

Also, I can see now that there are a few different things we are talking about here in this thread all at the same time. One is to do with improving simplify for SLiM usage:

For forward simulations, we really do want to know about the branches back to the original generation, but these are removed by simplify as it gets rid of any topology above the roots. So, I think we want to have an option to simplify called keep_input_roots which preserves the roots of the trees in the input tree sequence.

Another is to do with recording useful information about extant lineages:

I think it's worth doing this because the data structures we have in memory during simplify will become more useful if we know exactly the amount of extant ancestry at all time points. Particularly for things like IBD that @gtsambos is interested in.

Both of these are very important and interesting, but also have different ends in mind. It is of course very cool if A and B can be hit with the same stone, but if there's a straightforward way of doing A and another straightforward (but different) way of doing B, that might be better than a complicated way of doing them together

petrelharp commented 4 years ago

Gee, that's a good idea, @gtsambos - restating, one option is to iterate over the initial trees to get a list of the roots, and provide these as a list of "keep these if they have any ancestry" nodes; however, in SLiM at least we know which nodes these are already just based on their time. I think that removing ancestry (#782) is the way to go if it can be made to work reasonably and doesn't slow things down, but this is a good other option.

bhaller commented 4 years ago

Gee, that's a good idea, @gtsambos - restating, one option is to iterate over the initial trees to get a list of the roots, and provide these as a list of "keep these if they have any ancestry" nodes; however, in SLiM at least we know which nodes these are already just based on their time. I think that removing ancestry (#782) is the way to go if it can be made to work reasonably and doesn't slow things down, but this is a good other option.

Jumping in here with one comment. I don't think it is true in general that we know which nodes are first-generation in SLiM based just on their time, because it is possible to create a new, empty (ancestryless) subpopulation in any generation, not just in the first generation.

petrelharp commented 4 years ago

it is possible to create a new, empty (ancestryless) subpopulation in any generation,

True - based on the flag, then.

jeromekelleher commented 4 years ago

closed in #782