tskit-dev / tskit

Population-scale genomics
MIT License
157 stars 73 forks source link

Change MutationTable.node to MutationTable.edge? #668

Closed jeromekelleher closed 4 years ago

jeromekelleher commented 4 years ago

Currently a mutation knows what node it happened over, but really it should know what edge it happened on. Conceptually, rather than storing the node ID, we should be storing the edge ID - then, you can look up both the child and parent node for a given mutation, as well as know the spatial extent of the edge.

I can't think of any immediate applications of this beyond checking that mutation times are valid and viz, so I think it's perhaps it's one to acknowledge as a sub optimal design and move on. It would cause a fair bit of breakage if we were to change the Tables API so that the MutationTable.node became MutationTable.edge. The TreeSequence API could be updated in a non-breaking way, I'd imagine, as we'd fill out Mutation.node by looking up the edge table.

There's a few ways we could go about doing this, if we did it. Here's the "ripping off the band aid" way:

I don't think there'd be that much downstream breakage if we change the tables API. Most things that use the Tables API are either within tskit, or closely related projects. Since we're already going through a major file version bump for adding the time field to mutations (#513), there's no additional pain caused there.

bhaller commented 4 years ago

As I wrote on Slack: For what it’s worth, a quick look through SLiM’s code indicates that adapting to this change would have essentially no impact. We do not use that column, we just populate it and copy it in a couple places. Maybe I missed a spot or two, but anyway, looks trivial for us. @petrelharp

petrelharp commented 4 years ago

Hm, let's see, thinking this through out loud... Currently, we store the most recent node that has inherited the mutation, so that anyone inheriting from that node also inherits that mutation (unless other mutations intervene). That does seem like a good choice. We also know where the mutation is, thanks to the site. Changing it to the edge would add in the information of the most recent node that had the haplotype that the mutation occurred on. This is strictly more information, since the edge contains the (child) node. Changing this to edge would make associating mutations with edges an O(1) lookup, which would be nice. So, it sounds like a good idea, although I can't think of any applications other than error checking, as you say, once we have the times. This is because what we usually want mutations to be associated with is the branch in a tree, and the way we refer to these is by node, not by edge ID. In fact, edges mean different things across their spans, as the stuff below them moves around. Still, the node is easily obtainable from the edge.

Oh, but there's a problem: what about mutations over nodes for which there's no edge? To do those right, we could add ghost edges for such nodes (with parent = NULL) - but then, perhaps we should remove ancestral state from the site table entirely? Or, we could extend sites, so that there is one site for each root of the trees at that position of the genome (so, we'd remove the restriction that site positions be unique, and add a node attribute to sites)?

It would be kinda nice to not need to deal with the distinction between ancestral state and derived state, but this is starting to be a more wide-reaching change. For instance, it would mean a minor redefinition of what a "polarized" site statistic means: instead of summing over all non-ancestral states, it'd be summing over all non-root states. This is not a big deal, since it would only differ when the full tree isn't known, when people probably shouldn't compute site stats anyhow.

jeromekelleher commented 4 years ago

Oh, but there's a problem: what about mutations over nodes for which there's no edge? To do those right, we could add ghost edges for such nodes (with parent = NULL) - but then, perhaps we should remove ancestral state from the site table entirely? Or, we could extend sites, so that there is one site for each root of the trees at that position of the genome (so, we'd remove the restriction that site positions be unique, and add a node attribute to sites)?

Good point @petrelharp! Mutations over roots are a problem. Had we made this choice earlier on I'd imagine we'd have said "oh well, so you can't have mutations over roots, big deal". Or if you really do want to have mutations over "roots", then you have to stick in a unary edge over them. Having such mutations "floating in space" has always felt wrong to me, so forcing the explicit addition of an edge which the mutation must fall on doesn't seem so bad to me.

Removing the ancestral_state would be a change of another order entirely though, which I don't think I have the stomach for. I would break everything, top-to-bottom if we changed that now.

jeromekelleher commented 4 years ago

Can you think of any real example where we would have a mutation over a root @petrelharp? In any real example there would always be a branch for the mutation to happen on, I think - it's just contrived stuff in the tests that we'd have to worry about.

petrelharp commented 4 years ago

Well, currently anyhow, at the end of a SLiM run you have a lot of mutations over the roots - all the mutations that have fixed. In principle we'd only need the last one, and so it could go into ancestral state instead, I suppose... except, what if you have more than one root? (edit: not true, see next comment) Or, I guess - what if you simulate two populations, and initialize each to have different states, and they haven't totally coalesced? You could do this equivalently by making some fake parents and putting those mutations on the edges going back to them, sure.

Hm, or what if you've got a pedigree that doesn't go back forever, and you've imputed the genotypes of the oldest ancestors, but don't know how they are related? You could stick a made-up tree sequence on top to represent these ancestral sequences, but it seems nicer to be able to represent this information without having to make up extra things.

It feels to me that we need a way to say what the state is at the root of each tree, when there's more than one tree, else there's situations we can't represent - or, disallow multiple roots entirely. I don't like having to make up imaginary nodes to put those mutations on, so why not just allow edge.parent to be NULL?

petrelharp commented 4 years ago

ps. SLiM has a concept of "substitutions", so that as you run a simulation you have access to all the mutations that have fixed over the course of the simulation. When we restart a simulation from a tree sequence, SLiM recovers these by looking at the mutations that are over the roots of the trees. So, if we want to be able to completely recover the state of the simulation from the tree sequence (which I think we do!) then we need multiple mutations over the root, somehow. (and you need all of them, not just the last one, contrary to what I said above)

jeromekelleher commented 4 years ago

OK, thanks @petrelharp. That makes things a bit trickier, but we won't break SLiM. I initially don't like the idea of making the parent an edge potentially NULL, because we'll have to figure out every place we use edge.parent and put in a special case there. Things like our root-tracking method will be complicated and so on.

But maybe there's no that many places we use edge.parent. We'll need to have a look and see what breaks.

jeromekelleher commented 4 years ago

I had a look at the tree generation code, and having edges with parent == NULL isn't a trivial change. The way we keep track of roots and so on is quite involved, and I expect this will break. It's not impossible to do, but it would take a fair bit of thought and careful testing to make sure that all the corner cases are handled.

Coming back to the forward simulation case, is it really so bad to have edges on which the fixed mutations occur? I seem to remember that a lot of the time we want to have the time=0 population in there as "remembered individuals" anyway. If this is the case, won't all the fixed mutations be on the edges going back to these ancient individuals? Logically, I can't see a reason why we'd be interested in the fixed mutations but not in the edge that these mutations arose on.

I understand that this isn't the way that things are now, but would you consider moving to a position where they are? We can easily imagine extending simplify to keep the most recent node ancestral to all the mutations along a unary chain. This would be the key technical change for forward simulations, I guess. There would be some pain in upgrading files that already exist, but we are doing a major version bump on the file format so we can figure out an upgrade path that will maintain the same semantics as before for existing files.

petrelharp commented 4 years ago

I seem to remember that a lot of the time we want to have the time=0 population in there as "remembered individuals" anyway. If this is the case, won't all the fixed mutations be on the edges going back to these ancient individuals?

That's a good point - yes, they will be. And yes, if that change was made to simplify then we'd be all good. Good thinking.

For the other more hypothetical use case: suppose we've got genotypes of some ancestors and no idea where they came from: then ew'd just need to - say - make a dummy ancestor with a whole-genome edge going to every one of these, with all the mutations creating these ancestral sequences on them. Since we'd already presumably have to make up times for those mutations, this seems like no big deal.

It still seems cleaner to use NULL to say that "we don't know who/when this parent was", but given that's a hard thing, I'm OK with not being able to represent that missing information.

molpopgen commented 4 years ago

I seem to remember that a lot of the time we want to have the time=0 population in there as "remembered individuals" anyway. If this is the case, won't all the fixed mutations be on the edges going back to these ancient individuals?

That's a good point - yes, they will be. And yes, if that change was made to simplify then we'd be all good. Good thinking.

This isn't true for all fixations, though? If a mutation arises on a descendant of a time 0 node, and the simulation is run long enough (or the mutation sweeps, say), the fixation will be on a more recent node. Is that right?

petrelharp commented 4 years ago

If a mutation arises on a descendant of a time 0 node, and the simulation is run long enough (or the mutation sweeps, say), the fixation will be on a more recent node. Is that right?

The parent of the edge that the mutation is on will still be in the tree sequence - that's the main important thing here.

bhaller commented 4 years ago

I seem to remember that a lot of the time we want to have the time=0 population in there as "remembered individuals" anyway. If this is the case, won't all the fixed mutations be on the edges going back to these ancient individuals?

That's a good point - yes, they will be. And yes, if that change was made to simplify then we'd be all good. Good thinking.

So if I understand correctly, the changes being discussed would not require any change in SLiM's behavior; is that correct? @petrelharp

Suppose you run a SLiM simulation past coalescence, and have mutations that are on the edges leading back to the first-generation ancestors retained by SLiM. If you save and load that TS into Python, and then simplify, what happens exactly? The first-gen ancestors get simplified away; where do the mutations on those edges end up, under the changes being proposed?

molpopgen commented 4 years ago

If a mutation arises on a descendant of a time 0 node, and the simulation is run long enough (or the mutation sweeps, say), the fixation will be on a more recent node. Is that right?

The parent of the edge that the mutation is on will still be in the tree sequence - that's the main important thing here.

Right, I was thrown by the phrase "all the fixed mutations".

jeromekelleher commented 4 years ago

Suppose you run a SLiM simulation past coalescence, and have mutations that are on the edges leading back to the first-generation ancestors retained by SLiM. If you save and load that TS into Python, and then simplify, what happens exactly? The first-gen ancestors get simplified away; where do the mutations on those edges end up, under the changes being proposed?

I think (and @petrelharp will correct me here) the difference is that the first-gen ancestors won't get simplified away any more @bhaller. Currently the mutations representing the fixations are sort of floating above the roots, but now we'd keep them on the branches leading back to the first generation.

jeromekelleher commented 4 years ago

OK, seems like there's rough agreement that this isn't a terrible idea, assuming that it's not too awful to implement. I'll have a go at the implementation side once the mutation time stuff has landed in #672 and report back.

bhaller commented 4 years ago

Suppose you run a SLiM simulation past coalescence, and have mutations that are on the edges leading back to the first-generation ancestors retained by SLiM. If you save and load that TS into Python, and then simplify, what happens exactly? The first-gen ancestors get simplified away; where do the mutations on those edges end up, under the changes being proposed?

I think (and @petrelharp will correct me here) the difference is that the first-gen ancestors won't get simplified away any more @bhaller. Currently the mutations representing the fixations are sort of floating above the roots, but now we'd keep them on the branches leading back to the first generation.

Interesting. I guess the first-gen ancestors would still get simplified away unless there was a mutation on an edge leading to that first-gen ancestor, though, right? But if a mutation is present, it could be thought of as "retaining" the edge it is on, and the nodes at the ends of that edge, across simplification?

I'm pondering this because I'm wondering whether this change might break anybody's analysis code. I.e., code that loads a TS from SLiM, calls simplify() to strip away everything above coalescence, and then does analysis from there. That's a pretty common thing to do. But it would probably be unusual to do that style of analysis if one had fixed mutations in the tree. @petrelharp, thoughts?

jeromekelleher commented 4 years ago

Interesting. I guess the first-gen ancestors would still get simplified away unless there was a mutation on an edge leading to that first-gen ancestor, though, right? But if a mutation is present, it could be thought of as "retaining" the edge it is on, and the nodes at the ends of that edge, across simplification?

Exactly.

I think it's pretty unlikely we'd break anyone's analysis code, but we should definitely think the consequences through.

petrelharp commented 4 years ago

Uh-oh, there's a different, maybe worse wrinkle:

 So if I understand correctly, the changes being discussed would not require any change in SLiM's behavior; is that correct?

Not the behavior, but it would actually make SLiM's job harder when recording tables, since it will need to know not only which genome the mutation is happening in, but who the ancestor was that individual inherited that bit of genome from. (ie the edge it got that genome along) That's actually a pretty significant change - currently, forward simulators have to do bookkeeping for genomes/individuals (which of course they do) but they don't have to do bookkeeping about edges, which is more of a tskit thing.

I guess to do this, SLiM would have to associate with each individual something like a list of edge IDs, so that when mutations happen it can go look up which edge the mutation is supposed to be on.

bhaller commented 4 years ago

I guess to do this, SLiM would have to associate with each individual something like a list of edge IDs, so that when mutations happen it can go look up which edge the mutation is supposed to be on.

Yikes. That is indeed a big change, and would have performance implications.

Jerome originally wrote, above:

I can't think of any immediate applications of this beyond checking that mutation times are valid and viz, so I think it's perhaps it's one to acknowledge as a sub optimal design and move on.

It sounds to me like it's time to revisit whether doing this is worth the headaches.

molpopgen commented 4 years ago

but it would actually make SLiM's job harder when recording tables, since it will need to know not only which genome the mutation is happening in, but who the ancestor was that individual inherited that bit of genome from. (ie the edge it got that genome along)

Yeah, I'd just written about this to @jeromekelleher via email. It turns out that it is straightforward to write an efficient mapping from current mutation table indexes to the edge table. I have a prototype that I'd sent @jeromekelleher to run by him before bringing it up here. It uses the fwdpp table layout, but is amenable to implementing in C, too, for tskit. I'll take a crack at that and get back to you all, hopefully next week, if that sounds good?

petrelharp commented 4 years ago

It turns out that it is straightforward to write an efficient mapping from current mutation table indexes to the edge table.

This sounds promising, but could you explain what you mean, exactly? Is this solving the problem of how to find the edge ID given the position and node? or something else?

molpopgen commented 4 years ago

This sounds promising, but could you explain what you mean, exactly? Is this solving the problem of how to find the edge ID given the position and node? or something else?

Yes, each row in the mutation table currently has a node and refers to a site, which has the position. Using those two values, you can get an array of len(tables.mutations) that contains the indexes into tables.edges using a bunch of binary searches and one sort.

petrelharp commented 4 years ago

Yes, each row in the mutation table currently has a node and refers to a site, which has the position. Using those two values, you can get an array of len(tables.mutations) that contains the indexes into tables.edges using a bunch of binary searches and one sort.

Ah, yes, and it's a modular step! Good point. So, perhaps a forward simulator could do something like this:

  1. store the node ID in the "edge" column of the mutation table (or somewhere else)
  2. before simplification, when we do other cleanup (eg deduplicating sites) swap these out for the actual edge IDs

That'd add one more cleanup step; if this is a method in tskit and not too computationally costly, then that'd be fine.

bhaller commented 4 years ago

That'd add one more cleanup step; if this is a method in tskit and not too computationally costly, then that'd be fine.

Well, it will certainly have a computational cost; by the sound of it, a binary search for each edge that needs to be looked up, plus a sort, as @molpopgen said, right? I'm just not sure why we're doing this – making things slower and more complicated for no clear benefit. I'm willing to be outvoted, but the motivation seems unclear at best, to me.

molpopgen commented 4 years ago

I'm with @bhaller on this one, but I also wasn't fully clear. My thought are:

If one wants mutation -> edge maps, build them on demand, which is probably just once when the tree sequence is finalized? I think the use for this mapping is all/mostly post-simulation, so my vote is to definitely not drop node.

benjeffery commented 4 years ago

From what I can see adding a method to return a mutation->edge map, which is calculated on demand, seems to fulfill the use cases. I can see the attraction of the conceptual "rightness" but not sure it's worth the disruption.

petrelharp commented 4 years ago

Ah, that makes sense. This might be like another index, for instance.

molpopgen commented 4 years ago

This might be like another index, for instance.

Right, so we'd need to decide where to hold it in tskit. Presumably in the table collection and/or the tree sequence, but I'd have to remind myself how these look and relate to each other at the C level.

jeromekelleher commented 4 years ago

It's looking increasingly like this isn't worth the trouble, but I think it was worthwhile going through the exercise of thinking through why storing the node is a good idea. Thanks to all for taking the time to think this through. I think we can probably find the edge in log time using the current indexes, and I'm sure that's fine in practical terms.

Leave it with me and I'll see what can be done.

jeromekelleher commented 4 years ago

I've spent some time looking into the details of how this would work, and (a) it would be a horrible mess and (b) would make simplify significantly harder. The reason that simplify would be harder is that we'd need to keep track to the mapping between edges in the input ts and the output ts, and this is a non-trivial thing to do. Edges are chopped up and merged together in all sorts of ways during simplify, so having to keep track of which old edge a new edge ended up mapping to would be quite a lot of extra bookkeeping.

So: we made the right choice originally, hooray!

I'll open an issue to track the functionality for finding out which edge a give mutation occurs on (which would be useful), but it's not in any way urgent so we can punt it down the road.

Thanks again for the input everyone.