Open jeromekelleher opened 4 years ago
(We could also consider doing the same for Individuals)
Sounds good! But no, doesn't affect our stuff, really.
On second thought, how about instead remap_populations
, that takes a new population table and a num_populations-long list of pop ids mapping the old pops to the new ones?
This'd be more flexible, and relying on metadata being identical seems to break the "tskit doesn't use metadata" principle. It'd be pretty easy for the end user to make the appropriate arguments for your use case.
@jeromekelleher, how are you pasting the tree sequences together?
I think following @petrelharp, you could "glue" tree sequences together using TableCollection.union
, but add an additional parameter, remap_populations
, that deals with equivalency of populations. Right now this is dealt with a simpler parameter add_population
, that allows for either complete or no equivalency between the populations of the two table collections. Actually, @petrelharp gave me this idea some time ago, but at the time we decided to go with something simpler.
We're doing the pasting together before they become tree sequences @mufernando - in tsinfer, we're slicing and dicing various input files and smushing them together in various ways, and we figured this would be a good way to resolve any duplicated population entries that we end up with in the output tree sequences.
relying on metadata being identical seems to break the "tskit doesn't use metadata" principle.
I think this would be an explicit exception. And anyway, it wouldn't need to understand the metadata or scheme - it would just check for simple identity. Possibly the function name could make it clear: deduplicate_populations_via_metadata
(yuck!)
It'd be pretty easy for the end user to make the appropriate arguments for your use case.
I'm not sure that's true. Only if you know beforehand which samples came from equivalent populations. But you might not. Merging on metadata seems fine to me, and it would be useful for a user to be given this functionality rather than have to re-invent it and have to understand how metadata actually works (which I found to be rather a steep learning curve)
This would be a python API only method correct? Then you could use the decoded metadata for equality.
When we're pasting together different datasets, we'll sometimes end up with multiple copies of the same populations, say 1000G CEU. It would be useful to have a way to deduplicate these populations, so that we only have one copy of each and all the references within a table collection get updated to point to this copy.
I guess we can make the definition of equality fairly flexible, since we'll be doing this in Python (I'd imagine?), but a simple way to define population equality is to have identical encoded metadata.
@petrelharp, @mufernando, does this have any bearing on what you've been working on recently?