tskit-dev / tskit

Population-scale genomics
MIT License
153 stars 72 forks source link

add `individuals=` argument to simplify #464

Open petrelharp opened 4 years ago

petrelharp commented 4 years ago

An error that I have seen quite a few people make already is that they pass in a list of individual ids to samples=, instead of their constituent nodes. It's an easy mistake to make. If we had an "individuals" argument to simplify, this could help avoid this confusion. It might work like this:

for ind in individuals:
   nn = self.individual(ind).nodes
   warned = False
   for n in n:
     if (not warned) and n in samples:
        raise warnings.warn("Some individual nodes are already present in the 'samples' argument - did you mean to pass both individuals and their nodes?")
        warned = True
     samples.append(n)
jeromekelleher commented 4 years ago

Good idea. Why not make the individuals and samples arguments mutually exclusive though? Seems simpler, and I can't think of a good reason for mixing them.

petrelharp commented 4 years ago

Yeah, you're right. I guess the main question is whether to make samples optional with default None or make people pass in None explicitly (the first, I imagine?).

jeromekelleher commented 4 years ago

I think samples defaults to None already (in this case it'll take all the samples in the ts). So another individuals=None argument should work fine.

hyanwong commented 3 years ago

Here's the nuclear option. We could do the above and then also make all the arguments to simplify keyword-only, so users would have to say either samples= or individuals= in the call. It would break a lot of code (mostly our own, I think), so we'd have to have a release where we issued a deprecation warning first. But it would (hopefully) reduce the number of times users would make the error. Almost certainly too extreme a solution, but I thought it worth mentioning.

grahamgower commented 3 years ago

+1 for this option. Simplifying to n individuals by looking just at their nodes does not always result in n individuals after simplification.

Consider a temporally sampled tree sequence, and the desire to simplify to a subset (say A) of individuals. Suppose also that in the pre-simplified tree sequence, there is an individual not in A which contains a node that is an ancestor to individuals in A. As far as I can tell, simplifying to the nodes in the A subset of individuals also retains the ancestor individual who is not in A.

petrelharp commented 3 years ago

Consider a temporally sampled tree sequence, and the desire to simplify to a subset (say A) of individuals. Suppose also that in the pre-simplified tree sequence, there is an individual not in A which contains a node that is an ancestor to individuals in A. As far as I can tell, simplifying to the nodes in the A subset of individuals also retains the ancestor individual who is not in A.

That's true, and by design: what simplify does is to retain only information about the samples and the nodes necessary to describe their genealogies. "Individuals" are really just metadata about nodes, and we don't throw away the metadata of the ancestral nodes either.

But, I realize that this is a common error, to assume that samples and individuals are interchangeable. This is why in pyslim we've got individuals_alive_at( ), to easily pull out the individuals you want. I guess I think we should deal with that elsewhere, not in simplify? Options might be (a) remind people more about this common error (b) provide a tool to remove certain individuals so people can 'tidy' up their tree sequence (c) provide a tool to more easily extract the individuals they want (I think we have an issue for this: e.g., give me the individuals attached to these nodes).

grahamgower commented 3 years ago

I was hoping that an individuals= option might return a tree sequence with the individuals table limited to the specified individuals. Wouldn't that be the least surprising thing?

petrelharp commented 3 years ago

I was hoping that an individuals= option might return a tree sequence with the individuals table limited to the specified individuals. Wouldn't that be the least surprising thing?

I don't think so - for instance, you might argue by analogy that you should also have those individuals parents, now that we have those. I think the least confusing thing is just to make individuals= argument basically an alias for nodes=(the nodes in those individuals.

hyanwong commented 2 years ago

I was hoping that an individuals= option might return a tree sequence with the individuals table limited to the specified individuals. Wouldn't that be the least surprising thing?

Perhaps what you want, @grahamgower, is a delete_individuals function. You call that first, then simplify(individuals=np.arange(ts.num_individuals)). This does, however, result in the nodes associated with all the individuals being marked as samples, which might not be what you want (there are several simulation cases where non-sample nodes are associated with individuals)?