Open petrelharp opened 5 years ago
It's debatable whether we want (a) at all - @hyanwong and @awohns, any thoughts here?
@awohns will have the strongest opinion here, but I think that it should be possible to do (a) easily. I think there are reasonable use-cases, and I think Gil agrees that it may be a useful stat.
Yes, I agree with @hyanwong that there are cases where you want to see whether the node is actually an ancestor vs. how is is more distantly related to labeled nodes (as in the standard GNN statistic). In the ancient-DNA-tree-sequence-inference-world (for instance), this would distinguish between ancient samples that are inferred to be direct ancestors of modern samples vs. ancient samples that are relatives of the ancestors of modern samples.
Great, sounds like we all agree with @petrelharp then. Adding this to the 0.2.0 release jobs.
As discussed in #328, moving this out to 0.2.1 (with some minor changes for forward compat)
Here's maybe an implementation (from https://github.com/petrelharp/treestats_ms/issues/43 - the same statistic?)
def genomic_descent(ts, sample_sets):
n = np.array([len(x) for x in sample_sets])
def f(x):
out = x/n
out.append(1.0 * (sum(x) > 0))
return out
A = ts.sample_count_stat(sample_sets, f, len(sample_sets) + 1, polarised=True, mode='node', strict=False)
D = A[:, :-1]
for k in range(D.shape[1]):
D[:, k] /= A[:, -1]
return D
... and also, it'd be nice to have the statistic that is simply "how much of the sample genomes did each node contribute", which is:
def compute_ancestry(ts, sample_sets):
n = np.array([len(x) for x in sample_sets])
def f(x):
return x/n
A = ts.sample_count_stat(sample_sets, f, len(sample_sets), polarised=True, mode='node', strict=False)
return A
@awohns, would you mind taking a look at this please? It would be great if we could replace the code for our bespoke stats with this.
So here's my understanding of mean_descendants, genomic_descent, and compute_ancestry: as you mentioned @petrelharp , compute_ancestry does not normalise by the span where each node is ancestral to any node in a sample set and thus differs from the other two. The difference between genomic_descent and mean_descendants is that former also normalises by the size of each sample_set.
I can't think of a situation where you wouldn't want to normalise by the size of sample sets, so mean_descendants doesn't seem so useful to me.
Both genomic_descent and compute_ancestry seem like they have their uses. I guess genomic_descent is the tool to use if you're primarily interested in a focal, internal node, while compute_ancestry is better if you're more interested in the sample_sets' ancestry. As I mentioned above in the ancient DNA example it would make more sense to use genomic_descent
So, shall we just get rid of mean_descendants
?
Let's not forget about #328 though (I had!)
Reading over #328, it seems like this discussion was before the formal definitions of genomic_descent and compute_ancestry. Given that those two exist I don't really see a use case for mean_descendants as a standalone statistic... but maybe somebody can think of a reason for keeping it?
I think that to get mean_descendants from @petrelharp's implementation of genomic_descent, all you need to do is remove the division by n in line 4.
OK, I'm happy to got with whatever you and @petrelharp think here. I just want to delete some code!
I agree, that normalizing by the size of the sample set is always sensible; it's at least easy for the end user to reverse, so we don't need an option for it (and certainly not a different function!).
Note: we don't have a function called compute_ancestry
already - when you say "these two exist" do you mean using the definition above in this thread?
All these things are ways of summarizing "what proportion of the genomes of X are inherited from Y" and "over what proportion of the genome is any of X descended from Y". I vote we should just provide a simple and descriptively-named method for computing these two quantities, in a way that makes it easy to compute these various downstream statistics. For instance, if we define:
def compute_ancestry(ts, sample_sets):
n = np.array([len(x) for x in sample_sets])
def f(x):
return x/n
A = ts.sample_count_stat(sample_sets, f, len(sample_sets), polarised=True, mode='node', strict=False)
return A
def any_ancestry(ts, sample_sets):
n = np.array([len(x) for x in sample_sets])
def f(x):
return 1.0 * (sum(x) > 0)
A = ts.sample_count_stat(sample_sets, f, 1, polarised=True, mode='node', strict=False)
return A
def genomic_descent(ts, sample_sets):
A = ts.compute_ancestry(sample_sets)
D = ts.any_ancestry(sample_sets)
for k in range(A.shape[1]):
A[:, k] /= D
return A
So, my propsal would be to remove (or, redefine?) mean_descendants
, implement the two functions that I have above named compute_ancestry
and any_ancestry
, and figure out more descriptive names for them. And, probably, make the normalization by any_ancestry
an argument to mean_descendants
.
What do you think, @awohns - is this sufficiently general? Ideas for what to call these things?
How does this differ from a time-windowed GNN with a time window of 0 (i.e. go up 0 generations from the focal node, then descend down? Is it just a matter of the normalisation? Or have I misunderstood?
Sorry I didn't respond to your point @petrelharp, what you proposed is sensible to me.
Thanks for reviving this thread @hyanwong. So I think that if what these statistics try to answer is "what proportion of the genomes of X are inherited from Y", while the equivalent question posed by GNN and windowed GNN is "among the neighbours of Y, what proportion are from X." So I think these stats are defined with respect to the sample sets, while GNN/WGNN is defined WRT the focal node's neighbours.
Here's a simple example.
ts = msprime.simulate(4, random_seed=10)
SVG(ts.draw_svg())
ts.genealogical_nearest_neighbours([5], [[0], [1], [2], [3]])
gives:
array([[0. , 0.33333333, 0.33333333, 0.33333333]])
This is telling us, of the three nearest neighbours of node 5, 1/3 are from ref set [1] etc.
ts.mean_descendants( [[0], [1], [2], [3]])[5]
gives:
array([0., 1., 1., 1.])
This is telling us that 100% of the "total genetic material" of reference sets [1], [2], and [3] (and 0% of [0]) come from node 5.
Right. But there is a close link between (windowed) GNN and genomic descent. It would be helpful to give an example like this somewhere, when describing it.
As discussed in #271, this is a node statistic. The differences to current node stats are that (a) it normalizes by the amount that the node is actually an ancestor; and (b) it is polarised (and
polarised=False
will give something that is always equal to 1). I propose that we make the behavior of (a) a parameter, and implement that normalization at the python level.