Open petrelharp opened 4 years ago
Good idea. Why not make the individuals
and samples
arguments mutually exclusive though? Seems simpler, and I can't think of a good reason for mixing them.
Yeah, you're right. I guess the main question is whether to make samples
optional with default None
or make people pass in None
explicitly (the first, I imagine?).
I think samples
defaults to None already (in this case it'll take all the samples in the ts). So another individuals=None
argument should work fine.
Here's the nuclear option. We could do the above and then also make all the arguments to simplify
keyword-only, so users would have to say either samples=
or individuals=
in the call. It would break a lot of code (mostly our own, I think), so we'd have to have a release where we issued a deprecation warning first. But it would (hopefully) reduce the number of times users would make the error. Almost certainly too extreme a solution, but I thought it worth mentioning.
+1 for this option. Simplifying to n
individuals by looking just at their nodes does not always result in n
individuals after simplification.
Consider a temporally sampled tree sequence, and the desire to simplify to a subset (say A
) of individuals. Suppose also that in the pre-simplified tree sequence, there is an individual not in A
which contains a node that is an ancestor to individuals in A
. As far as I can tell, simplifying to the nodes in the A
subset of individuals also retains the ancestor individual who is not in A
.
Consider a temporally sampled tree sequence, and the desire to simplify to a subset (say A) of individuals. Suppose also that in the pre-simplified tree sequence, there is an individual not in A which contains a node that is an ancestor to individuals in A. As far as I can tell, simplifying to the nodes in the A subset of individuals also retains the ancestor individual who is not in A.
That's true, and by design: what simplify does is to retain only information about the samples
and the nodes necessary to describe their genealogies. "Individuals" are really just metadata about nodes, and we don't throw away the metadata of the ancestral nodes either.
But, I realize that this is a common error, to assume that samples and individuals are interchangeable. This is why in pyslim we've got individuals_alive_at( )
, to easily pull out the individuals you want. I guess I think we should deal with that elsewhere, not in simplify
? Options might be (a) remind people more about this common error (b) provide a tool to remove certain individuals so people can 'tidy' up their tree sequence (c) provide a tool to more easily extract the individuals they want (I think we have an issue for this: e.g., give me the individuals attached to these nodes).
I was hoping that an individuals=
option might return a tree sequence with the individuals table limited to the specified individuals. Wouldn't that be the least surprising thing?
I was hoping that an individuals= option might return a tree sequence with the individuals table limited to the specified individuals. Wouldn't that be the least surprising thing?
I don't think so - for instance, you might argue by analogy that you should also have those individuals parents, now that we have those. I think the least confusing thing is just to make individuals=
argument basically an alias for nodes=(the nodes in those individuals
.
I was hoping that an
individuals=
option might return a tree sequence with the individuals table limited to the specified individuals. Wouldn't that be the least surprising thing?
Perhaps what you want, @grahamgower, is a delete_individuals
function. You call that first, then simplify(individuals=np.arange(ts.num_individuals))
. This does, however, result in the nodes associated with all the individuals being marked as samples, which might not be what you want (there are several simulation cases where non-sample nodes are associated with individuals)?
An error that I have seen quite a few people make already is that they pass in a list of individual ids to
samples=
, instead of their constituent nodes. It's an easy mistake to make. If we had an "individuals" argument to simplify, this could help avoid this confusion. It might work like this: