tskit-dev / tskit

Population-scale genomics
MIT License
155 stars 73 forks source link

Return table ID mappings from TC.delete_sites and TC.delete_intervals #1474

Open hyanwong opened 3 years ago

hyanwong commented 3 years ago

It is sometimes helpful to know how the IDs of sites, mutations, nodes, etc have changed after some large manipulation like delete_intervals. We already return a node mapping from TableCollection.simplify(): it would be good to do the same for TableCollection.delete_intervals.

hyanwong commented 3 years ago

I suggest that a few TableCollection functions, which act to modify the tables in-place, should return a standard "TableIdMaps" object, with properties sites, mutations, individuals, populations, migrations, and nodes - but not edges. If any of these are None, it indicates a 1:1 mapping. The justification for omitting edges is that the edge ID is rarely used, and rather impermanent (e.g. an edge can be split or merged, losing its identity).

In the long term, we might want to return a TableIdMaps object from simplify too, but we can cross that bridge when we come to it.

petrelharp commented 3 years ago

This could save some headaches, for sure. I keep wondering if there's a simpler way to do this - for instance, data frame manipulation functions in R don't do this for you, so the standard thing to do is to add a new column that's the ID before doing the rearrangment. One could do this with metadata, but not without messing up existing metadata. Sites have a unique key already: their position; but other things don't. So, if you've got a good use case, then I agree, this is probably worth the fuss.

I think including edges as well is a good idea - it doesn't hurt and might be useful - for instance, a method that split edges (#1469) might use this to tell you which old edges corresponded to which new edges. Hm - but, I guess this raises the question of whether this is an old->new map or new->old map, since split edges would need a new->old map, but something that merges edges, if it returned this information, would need it the other way. Which it is could depend on the method?

hyanwong commented 3 years ago

I'm still not convinced about including edges - it's the biggest of the tables, and the IDs are not usually meaningful for most users. We could leave it off for the time being but note that if there is sufficient demands we could consider including it.

Either way, probably something to discuss when Jerome is back in action.

jeromekelleher commented 3 years ago

I think excluding edges would be a premature optimisation - relative to whatever we've already done, and extra array of |edge table| would be negligible.

hyanwong commented 3 years ago

I belated realised that I don't actually need this for the drawing bug #1477 , as all I need in that case is the offsets, which are easily calculated (#1478 ). So there's now no urgency for this functionality from me, but I think it would be generally useful anyway, and probably worth leaving this issue open - perhaps deferred to a later tskit release?

jeromekelleher commented 3 years ago

OK, SGTM. I'll leave it to @benjeffery to do the appropriate issue classification.

benjeffery commented 3 years ago

I've added this to "upcoming" for now. It seems a good idea to survey all the table modification methods to see how this proposal would work - for example to check all the methods track the ids as there is no way under the current proposal for a method to communicate that it re-ordered but that it doesn't know the mapping.

@petrelharp's suggesting of using the metadata is interesting - I guess you would save the old metadata, replace with a column of old ids, then write back the re-rodered metadata after. This method could wrap any table modification method that copied rows.