Open hyanwong opened 3 years ago
I suggest that a few TableCollection functions, which act to modify the tables in-place, should return a standard "TableIdMaps" object, with properties sites
, mutations
, individuals
, populations
, migrations
, and nodes
- but not edges
. If any of these are None, it indicates a 1:1 mapping. The justification for omitting edges is that the edge ID is rarely used, and rather impermanent (e.g. an edge can be split or merged, losing its identity).
In the long term, we might want to return a TableIdMaps
object from simplify
too, but we can cross that bridge when we come to it.
This could save some headaches, for sure. I keep wondering if there's a simpler way to do this - for instance, data frame manipulation functions in R don't do this for you, so the standard thing to do is to add a new column that's the ID before doing the rearrangment. One could do this with metadata, but not without messing up existing metadata. Sites have a unique key already: their position; but other things don't. So, if you've got a good use case, then I agree, this is probably worth the fuss.
I think including edges
as well is a good idea - it doesn't hurt and might be useful - for instance, a method that split edges (#1469) might use this to tell you which old edges corresponded to which new edges. Hm - but, I guess this raises the question of whether this is an old->new map or new->old map, since split edges would need a new->old map, but something that merges edges, if it returned this information, would need it the other way. Which it is could depend on the method?
I'm still not convinced about including edges
- it's the biggest of the tables, and the IDs are not usually meaningful for most users. We could leave it off for the time being but note that if there is sufficient demands we could consider including it.
Either way, probably something to discuss when Jerome is back in action.
I think excluding edges would be a premature optimisation - relative to whatever we've already done, and extra array of |edge table| would be negligible.
I belated realised that I don't actually need this for the drawing bug #1477 , as all I need in that case is the offsets, which are easily calculated (#1478 ). So there's now no urgency for this functionality from me, but I think it would be generally useful anyway, and probably worth leaving this issue open - perhaps deferred to a later tskit release?
OK, SGTM. I'll leave it to @benjeffery to do the appropriate issue classification.
I've added this to "upcoming" for now. It seems a good idea to survey all the table modification methods to see how this proposal would work - for example to check all the methods track the ids as there is no way under the current proposal for a method to communicate that it re-ordered but that it doesn't know the mapping.
@petrelharp's suggesting of using the metadata is interesting - I guess you would save the old metadata, replace with a column of old ids, then write back the re-rodered metadata after. This method could wrap any table modification method that copied rows.
It is sometimes helpful to know how the IDs of sites, mutations, nodes, etc have changed after some large manipulation like delete_intervals. We already return a node mapping from
TableCollection.simplify()
: it would be good to do the same forTableCollection.delete_intervals
.