tskit-dev / tskit

Population-scale genomics
MIT License
157 stars 74 forks source link

implement mean_descendants as a node statistic #273

Open petrelharp opened 5 years ago

petrelharp commented 5 years ago

As discussed in #271, this is a node statistic. The differences to current node stats are that (a) it normalizes by the amount that the node is actually an ancestor; and (b) it is polarised (and polarised=False will give something that is always equal to 1). I propose that we make the behavior of (a) a parameter, and implement that normalization at the python level.

jeromekelleher commented 5 years ago

It's debatable whether we want (a) at all - @hyanwong and @awohns, any thoughts here?

hyanwong commented 5 years ago

@awohns will have the strongest opinion here, but I think that it should be possible to do (a) easily. I think there are reasonable use-cases, and I think Gil agrees that it may be a useful stat.

awohns commented 5 years ago

Yes, I agree with @hyanwong that there are cases where you want to see whether the node is actually an ancestor vs. how is is more distantly related to labeled nodes (as in the standard GNN statistic). In the ancient-DNA-tree-sequence-inference-world (for instance), this would distinguish between ancient samples that are inferred to be direct ancestors of modern samples vs. ancient samples that are relatives of the ancestors of modern samples.

jeromekelleher commented 5 years ago

Great, sounds like we all agree with @petrelharp then. Adding this to the 0.2.0 release jobs.

jeromekelleher commented 5 years ago

As discussed in #328, moving this out to 0.2.1 (with some minor changes for forward compat)

petrelharp commented 4 years ago

Here's maybe an implementation (from https://github.com/petrelharp/treestats_ms/issues/43 - the same statistic?)

def genomic_descent(ts, sample_sets):
   n = np.array([len(x) for x in sample_sets])
   def f(x):
      out = x/n
      out.append(1.0 * (sum(x) > 0))
      return out
   A = ts.sample_count_stat(sample_sets, f, len(sample_sets) + 1, polarised=True, mode='node', strict=False)
   D = A[:, :-1]
   for k in range(D.shape[1]):
      D[:, k] /= A[:, -1]
   return D

... and also, it'd be nice to have the statistic that is simply "how much of the sample genomes did each node contribute", which is:

def compute_ancestry(ts, sample_sets):
   n = np.array([len(x) for x in sample_sets])
   def f(x):
      return x/n
   A = ts.sample_count_stat(sample_sets, f, len(sample_sets), polarised=True, mode='node', strict=False)
   return A
jeromekelleher commented 4 years ago

@awohns, would you mind taking a look at this please? It would be great if we could replace the code for our bespoke stats with this.

awohns commented 4 years ago

So here's my understanding of mean_descendants, genomic_descent, and compute_ancestry: as you mentioned @petrelharp , compute_ancestry does not normalise by the span where each node is ancestral to any node in a sample set and thus differs from the other two. The difference between genomic_descent and mean_descendants is that former also normalises by the size of each sample_set.

I can't think of a situation where you wouldn't want to normalise by the size of sample sets, so mean_descendants doesn't seem so useful to me.

Both genomic_descent and compute_ancestry seem like they have their uses. I guess genomic_descent is the tool to use if you're primarily interested in a focal, internal node, while compute_ancestry is better if you're more interested in the sample_sets' ancestry. As I mentioned above in the ancient DNA example it would make more sense to use genomic_descent

jeromekelleher commented 4 years ago

So, shall we just get rid of mean_descendants?

jeromekelleher commented 4 years ago

Let's not forget about #328 though (I had!)

awohns commented 4 years ago

Reading over #328, it seems like this discussion was before the formal definitions of genomic_descent and compute_ancestry. Given that those two exist I don't really see a use case for mean_descendants as a standalone statistic... but maybe somebody can think of a reason for keeping it?

I think that to get mean_descendants from @petrelharp's implementation of genomic_descent, all you need to do is remove the division by n in line 4.

jeromekelleher commented 4 years ago

OK, I'm happy to got with whatever you and @petrelharp think here. I just want to delete some code!

petrelharp commented 4 years ago

I agree, that normalizing by the size of the sample set is always sensible; it's at least easy for the end user to reverse, so we don't need an option for it (and certainly not a different function!).

Note: we don't have a function called compute_ancestry already - when you say "these two exist" do you mean using the definition above in this thread?

All these things are ways of summarizing "what proportion of the genomes of X are inherited from Y" and "over what proportion of the genome is any of X descended from Y". I vote we should just provide a simple and descriptively-named method for computing these two quantities, in a way that makes it easy to compute these various downstream statistics. For instance, if we define:

def compute_ancestry(ts, sample_sets):
   n = np.array([len(x) for x in sample_sets])
   def f(x):
      return x/n
   A = ts.sample_count_stat(sample_sets, f, len(sample_sets), polarised=True, mode='node', strict=False)
   return A

def any_ancestry(ts, sample_sets):
   n = np.array([len(x) for x in sample_sets])
   def f(x):
      return 1.0 * (sum(x) > 0)
   A = ts.sample_count_stat(sample_sets, f, 1, polarised=True, mode='node', strict=False)
   return A

def genomic_descent(ts, sample_sets):
   A = ts.compute_ancestry(sample_sets)
   D = ts.any_ancestry(sample_sets)
   for k in range(A.shape[1]):
      A[:, k] /= D
   return A

So, my propsal would be to remove (or, redefine?) mean_descendants, implement the two functions that I have above named compute_ancestry and any_ancestry, and figure out more descriptive names for them. And, probably, make the normalization by any_ancestry an argument to mean_descendants.

What do you think, @awohns - is this sufficiently general? Ideas for what to call these things?

hyanwong commented 3 years ago

How does this differ from a time-windowed GNN with a time window of 0 (i.e. go up 0 generations from the focal node, then descend down? Is it just a matter of the normalisation? Or have I misunderstood?

awohns commented 3 years ago

Sorry I didn't respond to your point @petrelharp, what you proposed is sensible to me.

Thanks for reviving this thread @hyanwong. So I think that if what these statistics try to answer is "what proportion of the genomes of X are inherited from Y", while the equivalent question posed by GNN and windowed GNN is "among the neighbours of Y, what proportion are from X." So I think these stats are defined with respect to the sample sets, while GNN/WGNN is defined WRT the focal node's neighbours.

Here's a simple example.

ts = msprime.simulate(4, random_seed=10)
SVG(ts.draw_svg())

Screenshot 2021-04-17 at 13 45 32 ts.genealogical_nearest_neighbours([5], [[0], [1], [2], [3]]) gives: array([[0. , 0.33333333, 0.33333333, 0.33333333]]) This is telling us, of the three nearest neighbours of node 5, 1/3 are from ref set [1] etc.

ts.mean_descendants( [[0], [1], [2], [3]])[5] gives: array([0., 1., 1., 1.]) This is telling us that 100% of the "total genetic material" of reference sets [1], [2], and [3] (and 0% of [0]) come from node 5.

hyanwong commented 3 years ago

Right. But there is a close link between (windowed) GNN and genomic descent. It would be helpful to give an example like this somewhere, when describing it.