Closed jeromekelleher closed 4 years ago
As I wrote on Slack: For what it’s worth, a quick look through SLiM’s code indicates that adapting to this change would have essentially no impact. We do not use that column, we just populate it and copy it in a couple places. Maybe I missed a spot or two, but anyway, looks trivial for us. @petrelharp
Hm, let's see, thinking this through out loud... Currently, we store the most recent node that has inherited the mutation, so that anyone inheriting from that node also inherits that mutation (unless other mutations intervene). That does seem like a good choice. We also know where the mutation is, thanks to the site. Changing it to the edge would add in the information of the most recent node that had the haplotype that the mutation occurred on. This is strictly more information, since the edge contains the (child) node. Changing this to edge would make associating mutations with edges an O(1) lookup, which would be nice. So, it sounds like a good idea, although I can't think of any applications other than error checking, as you say, once we have the times. This is because what we usually want mutations to be associated with is the branch in a tree, and the way we refer to these is by node, not by edge ID. In fact, edges mean different things across their spans, as the stuff below them moves around. Still, the node is easily obtainable from the edge.
Oh, but there's a problem: what about mutations over nodes for which there's no edge? To do those right, we could add ghost edges for such nodes (with parent = NULL) - but then, perhaps we should remove ancestral state from the site table entirely? Or, we could extend sites, so that there is one site for each root of the trees at that position of the genome (so, we'd remove the restriction that site positions be unique, and add a node attribute to sites)?
It would be kinda nice to not need to deal with the distinction between ancestral state and derived state, but this is starting to be a more wide-reaching change. For instance, it would mean a minor redefinition of what a "polarized" site statistic means: instead of summing over all non-ancestral states, it'd be summing over all non-root states. This is not a big deal, since it would only differ when the full tree isn't known, when people probably shouldn't compute site stats anyhow.
Oh, but there's a problem: what about mutations over nodes for which there's no edge? To do those right, we could add ghost edges for such nodes (with parent = NULL) - but then, perhaps we should remove ancestral state from the site table entirely? Or, we could extend sites, so that there is one site for each root of the trees at that position of the genome (so, we'd remove the restriction that site positions be unique, and add a node attribute to sites)?
Good point @petrelharp! Mutations over roots are a problem. Had we made this choice earlier on I'd imagine we'd have said "oh well, so you can't have mutations over roots, big deal". Or if you really do want to have mutations over "roots", then you have to stick in a unary edge over them. Having such mutations "floating in space" has always felt wrong to me, so forcing the explicit addition of an edge which the mutation must fall on doesn't seem so bad to me.
Removing the ancestral_state would be a change of another order entirely though, which I don't think I have the stomach for. I would break everything, top-to-bottom if we changed that now.
Can you think of any real example where we would have a mutation over a root @petrelharp? In any real example there would always be a branch for the mutation to happen on, I think - it's just contrived stuff in the tests that we'd have to worry about.
Well, currently anyhow, at the end of a SLiM run you have a lot of mutations over the roots - all the mutations that have fixed. In principle we'd only need the last one, and so it could go into ancestral state instead, I suppose... except, what if you have more than one root? (edit: not true, see next comment) Or, I guess - what if you simulate two populations, and initialize each to have different states, and they haven't totally coalesced? You could do this equivalently by making some fake parents and putting those mutations on the edges going back to them, sure.
Hm, or what if you've got a pedigree that doesn't go back forever, and you've imputed the genotypes of the oldest ancestors, but don't know how they are related? You could stick a made-up tree sequence on top to represent these ancestral sequences, but it seems nicer to be able to represent this information without having to make up extra things.
It feels to me that we need a way to say what the state is at the root of each tree, when there's more than one tree, else there's situations we can't represent - or, disallow multiple roots entirely. I don't like having to make up imaginary nodes to put those mutations on, so why not just allow edge.parent to be NULL?
ps. SLiM has a concept of "substitutions", so that as you run a simulation you have access to all the mutations that have fixed over the course of the simulation. When we restart a simulation from a tree sequence, SLiM recovers these by looking at the mutations that are over the roots of the trees. So, if we want to be able to completely recover the state of the simulation from the tree sequence (which I think we do!) then we need multiple mutations over the root, somehow. (and you need all of them, not just the last one, contrary to what I said above)
OK, thanks @petrelharp. That makes things a bit trickier, but we won't break SLiM. I initially don't like the idea of making the parent an edge potentially NULL, because we'll have to figure out every place we use edge.parent and put in a special case there. Things like our root-tracking method will be complicated and so on.
But maybe there's no that many places we use edge.parent. We'll need to have a look and see what breaks.
I had a look at the tree generation code, and having edges with parent == NULL isn't a trivial change. The way we keep track of roots and so on is quite involved, and I expect this will break. It's not impossible to do, but it would take a fair bit of thought and careful testing to make sure that all the corner cases are handled.
Coming back to the forward simulation case, is it really so bad to have edges on which the fixed mutations occur? I seem to remember that a lot of the time we want to have the time=0 population in there as "remembered individuals" anyway. If this is the case, won't all the fixed mutations be on the edges going back to these ancient individuals? Logically, I can't see a reason why we'd be interested in the fixed mutations but not in the edge that these mutations arose on.
I understand that this isn't the way that things are now, but would you consider moving to a position where they are? We can easily imagine extending simplify to keep the most recent node ancestral to all the mutations along a unary chain. This would be the key technical change for forward simulations, I guess. There would be some pain in upgrading files that already exist, but we are doing a major version bump on the file format so we can figure out an upgrade path that will maintain the same semantics as before for existing files.
I seem to remember that a lot of the time we want to have the time=0 population in there as "remembered individuals" anyway. If this is the case, won't all the fixed mutations be on the edges going back to these ancient individuals?
That's a good point - yes, they will be. And yes, if that change was made to simplify then we'd be all good. Good thinking.
For the other more hypothetical use case: suppose we've got genotypes of some ancestors and no idea where they came from: then ew'd just need to - say - make a dummy ancestor with a whole-genome edge going to every one of these, with all the mutations creating these ancestral sequences on them. Since we'd already presumably have to make up times for those mutations, this seems like no big deal.
It still seems cleaner to use NULL to say that "we don't know who/when this parent was", but given that's a hard thing, I'm OK with not being able to represent that missing information.
I seem to remember that a lot of the time we want to have the time=0 population in there as "remembered individuals" anyway. If this is the case, won't all the fixed mutations be on the edges going back to these ancient individuals?
That's a good point - yes, they will be. And yes, if that change was made to simplify then we'd be all good. Good thinking.
This isn't true for all fixations, though? If a mutation arises on a descendant of a time 0 node, and the simulation is run long enough (or the mutation sweeps, say), the fixation will be on a more recent node. Is that right?
If a mutation arises on a descendant of a time 0 node, and the simulation is run long enough (or the mutation sweeps, say), the fixation will be on a more recent node. Is that right?
The parent of the edge that the mutation is on will still be in the tree sequence - that's the main important thing here.
I seem to remember that a lot of the time we want to have the time=0 population in there as "remembered individuals" anyway. If this is the case, won't all the fixed mutations be on the edges going back to these ancient individuals?
That's a good point - yes, they will be. And yes, if that change was made to simplify then we'd be all good. Good thinking.
So if I understand correctly, the changes being discussed would not require any change in SLiM's behavior; is that correct? @petrelharp
Suppose you run a SLiM simulation past coalescence, and have mutations that are on the edges leading back to the first-generation ancestors retained by SLiM. If you save and load that TS into Python, and then simplify, what happens exactly? The first-gen ancestors get simplified away; where do the mutations on those edges end up, under the changes being proposed?
If a mutation arises on a descendant of a time 0 node, and the simulation is run long enough (or the mutation sweeps, say), the fixation will be on a more recent node. Is that right?
The parent of the edge that the mutation is on will still be in the tree sequence - that's the main important thing here.
Right, I was thrown by the phrase "all the fixed mutations".
Suppose you run a SLiM simulation past coalescence, and have mutations that are on the edges leading back to the first-generation ancestors retained by SLiM. If you save and load that TS into Python, and then simplify, what happens exactly? The first-gen ancestors get simplified away; where do the mutations on those edges end up, under the changes being proposed?
I think (and @petrelharp will correct me here) the difference is that the first-gen ancestors won't get simplified away any more @bhaller. Currently the mutations representing the fixations are sort of floating above the roots, but now we'd keep them on the branches leading back to the first generation.
OK, seems like there's rough agreement that this isn't a terrible idea, assuming that it's not too awful to implement. I'll have a go at the implementation side once the mutation time stuff has landed in #672 and report back.
Suppose you run a SLiM simulation past coalescence, and have mutations that are on the edges leading back to the first-generation ancestors retained by SLiM. If you save and load that TS into Python, and then simplify, what happens exactly? The first-gen ancestors get simplified away; where do the mutations on those edges end up, under the changes being proposed?
I think (and @petrelharp will correct me here) the difference is that the first-gen ancestors won't get simplified away any more @bhaller. Currently the mutations representing the fixations are sort of floating above the roots, but now we'd keep them on the branches leading back to the first generation.
Interesting. I guess the first-gen ancestors would still get simplified away unless there was a mutation on an edge leading to that first-gen ancestor, though, right? But if a mutation is present, it could be thought of as "retaining" the edge it is on, and the nodes at the ends of that edge, across simplification?
I'm pondering this because I'm wondering whether this change might break anybody's analysis code. I.e., code that loads a TS from SLiM, calls simplify() to strip away everything above coalescence, and then does analysis from there. That's a pretty common thing to do. But it would probably be unusual to do that style of analysis if one had fixed mutations in the tree. @petrelharp, thoughts?
Interesting. I guess the first-gen ancestors would still get simplified away unless there was a mutation on an edge leading to that first-gen ancestor, though, right? But if a mutation is present, it could be thought of as "retaining" the edge it is on, and the nodes at the ends of that edge, across simplification?
Exactly.
I think it's pretty unlikely we'd break anyone's analysis code, but we should definitely think the consequences through.
Uh-oh, there's a different, maybe worse wrinkle:
So if I understand correctly, the changes being discussed would not require any change in SLiM's behavior; is that correct?
Not the behavior, but it would actually make SLiM's job harder when recording tables, since it will need to know not only which genome the mutation is happening in, but who the ancestor was that individual inherited that bit of genome from. (ie the edge it got that genome along) That's actually a pretty significant change - currently, forward simulators have to do bookkeeping for genomes/individuals (which of course they do) but they don't have to do bookkeeping about edges, which is more of a tskit thing.
I guess to do this, SLiM would have to associate with each individual something like a list of edge IDs, so that when mutations happen it can go look up which edge the mutation is supposed to be on.
I guess to do this, SLiM would have to associate with each individual something like a list of edge IDs, so that when mutations happen it can go look up which edge the mutation is supposed to be on.
Yikes. That is indeed a big change, and would have performance implications.
Jerome originally wrote, above:
I can't think of any immediate applications of this beyond checking that mutation times are valid and viz, so I think it's perhaps it's one to acknowledge as a sub optimal design and move on.
It sounds to me like it's time to revisit whether doing this is worth the headaches.
but it would actually make SLiM's job harder when recording tables, since it will need to know not only which genome the mutation is happening in, but who the ancestor was that individual inherited that bit of genome from. (ie the edge it got that genome along)
Yeah, I'd just written about this to @jeromekelleher via email. It turns out that it is straightforward to write an efficient mapping from current mutation table indexes to the edge table. I have a prototype that I'd sent @jeromekelleher to run by him before bringing it up here. It uses the fwdpp
table layout, but is amenable to implementing in C, too, for tskit. I'll take a crack at that and get back to you all, hopefully next week, if that sounds good?
It turns out that it is straightforward to write an efficient mapping from current mutation table indexes to the edge table.
This sounds promising, but could you explain what you mean, exactly? Is this solving the problem of how to find the edge ID given the position and node? or something else?
This sounds promising, but could you explain what you mean, exactly? Is this solving the problem of how to find the edge ID given the position and node? or something else?
Yes, each row in the mutation table currently has a node
and refers to a site
, which has the position. Using those two values, you can get an array of len(tables.mutations)
that contains the indexes into tables.edges
using a bunch of binary searches and one sort.
Yes, each row in the mutation table currently has a
node
and refers to asite
, which has the position. Using those two values, you can get an array oflen(tables.mutations)
that contains the indexes intotables.edges
using a bunch of binary searches and one sort.
Ah, yes, and it's a modular step! Good point. So, perhaps a forward simulator could do something like this:
That'd add one more cleanup step; if this is a method in tskit and not too computationally costly, then that'd be fine.
That'd add one more cleanup step; if this is a method in tskit and not too computationally costly, then that'd be fine.
Well, it will certainly have a computational cost; by the sound of it, a binary search for each edge that needs to be looked up, plus a sort, as @molpopgen said, right? I'm just not sure why we're doing this – making things slower and more complicated for no clear benefit. I'm willing to be outvoted, but the motivation seems unclear at best, to me.
I'm with @bhaller on this one, but I also wasn't fully clear. My thought are:
node
in the mutation table.edge
as optional column, or forget it entirely.If one wants mutation -> edge maps, build them on demand, which is probably just once when the tree sequence is finalized? I think the use for this mapping is all/mostly post-simulation, so my vote is to definitely not drop node
.
From what I can see adding a method to return a mutation->edge map, which is calculated on demand, seems to fulfill the use cases. I can see the attraction of the conceptual "rightness" but not sure it's worth the disruption.
Ah, that makes sense. This might be like another index, for instance.
This might be like another index, for instance.
Right, so we'd need to decide where to hold it in tskit. Presumably in the table collection and/or the tree sequence, but I'd have to remind myself how these look and relate to each other at the C level.
It's looking increasingly like this isn't worth the trouble, but I think it was worthwhile going through the exercise of thinking through why storing the node
is a good idea. Thanks to all for taking the time to think this through. I think we can probably find the edge in log time using the current indexes, and I'm sure that's fine in practical terms.
Leave it with me and I'll see what can be done.
I've spent some time looking into the details of how this would work, and (a) it would be a horrible mess and (b) would make simplify significantly harder. The reason that simplify would be harder is that we'd need to keep track to the mapping between edges in the input ts and the output ts, and this is a non-trivial thing to do. Edges are chopped up and merged together in all sorts of ways during simplify, so having to keep track of which old edge a new edge ended up mapping to would be quite a lot of extra bookkeeping.
So: we made the right choice originally, hooray!
I'll open an issue to track the functionality for finding out which edge a give mutation occurs on (which would be useful), but it's not in any way urgent so we can punt it down the road.
Thanks again for the input everyone.
Currently a mutation knows what node it happened over, but really it should know what edge it happened on. Conceptually, rather than storing the
node
ID, we should be storing theedge
ID - then, you can look up both the child and parent node for a given mutation, as well as know the spatial extent of the edge.I can't think of any immediate applications of this beyond checking that mutation times are valid and viz, so I think it's perhaps it's one to acknowledge as a sub optimal design and move on. It would cause a fair bit of breakage if we were to change the Tables API so that the MutationTable.node became MutationTable.edge. The TreeSequence API could be updated in a non-breaking way, I'd imagine, as we'd fill out Mutation.node by looking up the edge table.
There's a few ways we could go about doing this, if we did it. Here's the "ripping off the band aid" way:
node
toedge
.edge
instead ofnode
.edge
attribute. Fill in the oldnode
item by looking up the edge table. Code that uses the mutation_t struct shouldn't be affected at all then.node
column as a property computed on demand, which should really minimise downstream breakage.I don't think there'd be that much downstream breakage if we change the tables API. Most things that use the Tables API are either within tskit, or closely related projects. Since we're already going through a major file version bump for adding the time field to mutations (#513), there's no additional pain caused there.