tskit-dev / tskit

Population-scale genomics
MIT License
155 stars 73 forks source link

Add TableCollection.deduplicate_populations method #728

Open jeromekelleher opened 4 years ago

jeromekelleher commented 4 years ago

When we're pasting together different datasets, we'll sometimes end up with multiple copies of the same populations, say 1000G CEU. It would be useful to have a way to deduplicate these populations, so that we only have one copy of each and all the references within a table collection get updated to point to this copy.

I guess we can make the definition of equality fairly flexible, since we'll be doing this in Python (I'd imagine?), but a simple way to define population equality is to have identical encoded metadata.

@petrelharp, @mufernando, does this have any bearing on what you've been working on recently?

jeromekelleher commented 4 years ago

(We could also consider doing the same for Individuals)

petrelharp commented 4 years ago

Sounds good! But no, doesn't affect our stuff, really.

petrelharp commented 4 years ago

On second thought, how about instead remap_populations, that takes a new population table and a num_populations-long list of pop ids mapping the old pops to the new ones?

This'd be more flexible, and relying on metadata being identical seems to break the "tskit doesn't use metadata" principle. It'd be pretty easy for the end user to make the appropriate arguments for your use case.

mufernando commented 4 years ago

@jeromekelleher, how are you pasting the tree sequences together?

I think following @petrelharp, you could "glue" tree sequences together using TableCollection.union, but add an additional parameter, remap_populations, that deals with equivalency of populations. Right now this is dealt with a simpler parameter add_population, that allows for either complete or no equivalency between the populations of the two table collections. Actually, @petrelharp gave me this idea some time ago, but at the time we decided to go with something simpler.

jeromekelleher commented 4 years ago

We're doing the pasting together before they become tree sequences @mufernando - in tsinfer, we're slicing and dicing various input files and smushing them together in various ways, and we figured this would be a good way to resolve any duplicated population entries that we end up with in the output tree sequences.

hyanwong commented 4 years ago

relying on metadata being identical seems to break the "tskit doesn't use metadata" principle.

I think this would be an explicit exception. And anyway, it wouldn't need to understand the metadata or scheme - it would just check for simple identity. Possibly the function name could make it clear: deduplicate_populations_via_metadata (yuck!)

It'd be pretty easy for the end user to make the appropriate arguments for your use case.

I'm not sure that's true. Only if you know beforehand which samples came from equivalent populations. But you might not. Merging on metadata seems fine to me, and it would be useful for a user to be given this functionality rather than have to re-invent it and have to understand how metadata actually works (which I found to be rather a steep learning curve)

benjeffery commented 4 years ago

This would be a python API only method correct? Then you could use the decoded metadata for equality.