petrelharp / treestats_ms

1 stars 1 forks source link

Comments from Wilder #43

Closed jeromekelleher closed 4 years ago

jeromekelleher commented 4 years ago

comments from @awohns

petrelharp commented 4 years ago

Page 13 caption to figure 5: “tree sequences inferred by Relate”, not quite tree sequences though, right?

What's this mean?

petrelharp commented 4 years ago

Hi, @awohns - here's the Genomic descent statistic definition: Screenshot from 2020-03-31 17-01-11

... what exactly does "copies from" mean here?

petrelharp commented 4 years ago

If "copies from" means "inherits from", then our example 1 (ancestry proportions) is similar to the "genomic descent" but not quite the same, since "genomic descent" divides by something - I think that you'd compute the genomic descent statistic like so (although not quite sure about the normalization, which seems strange?):

def genomic_descent(ts, sample_sets):
   n = np.array([len(x) for x in sample_sets])
   def f(x):
      out = x/n
      out.append(1.0 * (sum(x) > 0))
      return out
   A = ts.sample_count_stat(sample_sets, f, len(sample_sets) + 1, polarised=True, mode='node', strict=False)
   D = A[:, :-1]
   for k in range(D.shape[1]):
      D[:, k] /= A[:, -1]
   return D
petrelharp commented 4 years ago

Also - thanks very much for the comments!!

jeromekelleher commented 4 years ago

Page 13 caption to figure 5: “tree sequences inferred by Relate”, not quite tree sequences though, right? What's this mean?

I think Wilder's talking about the fact that Relate returns a JBOT (just a bunch of trees; I'm proud of that one, hideous hybrid of storage and popgen jargon). The distinction is being lost on people --- see the section on tsinfer here where they (mistakenly) say that tsinfer returns a JBOT --- so perhaps we should clarify a bit here.

awohns commented 4 years ago

Yes that's what I meant re: relate and tree sequences!

Regarding the genomic descent: copies from means inherits from (I was using the copying terminology because it was in the context of tsinfer). There are two normalisations in the genomic descent stat: 1st term in denominator: the total span of the node appearing at all in the tree sequence: the node might not exist for the whole tree sequence, or it might only be ancestral to samples in the reference sets for part of the sequence. 2nd term in denominator: the size of the reference set. Think of the extreme case where we have one reference set with 90% of our total samples and another with 10%, it will always look like node u is most closely related to the bigger reference sets.

The genomic_descent() function you sent seems equivalent to the definition based on a few examples I tested it on!

petrelharp commented 4 years ago

Ok, thanks @awohns! This could be a nice example to write up for a tskit vignette. (Note: I simplified the definition above somewhat.)

petrelharp commented 4 years ago

I've added a note after that example saying that it could be used to compute genomic descent.

petrelharp commented 4 years ago

I'm not sure what I should say about Relate returning JBOT: how about

Although Relate does not directly return a tree sequence, its output can be converted to a tree sequence.

... ?

jeromekelleher commented 4 years ago

I think we need to be more explicit with Relate:

Relate output can be converted into tree sequence format, albeit inefficiently as each tree must be stored independently. Although much of the efficiency of tree sequences are lost when no edges are shared between adjacent trees, branch and sites statistics are still well defined and can be computed (node statistics are less meaningful in this context).

As I say, people aren't getting the fact that Relate output is a JBOT (or worse, that tree sequences in general aren't JBOTs) so we need to make the point.

petrelharp commented 4 years ago

albeit inefficiently as each tree must be stored independently

So, I'm confused on this point. Looking at the Relate tree sequences, there are nodes shared across multiple trees, I'm pretty sure - do you mean something different? Or am I wrong on this?

jeromekelleher commented 4 years ago

Leo might have put in some optimisations to reduce the total number of nodes with some sort of reusing strategy - are there any edges that span more than one tree?

petrelharp commented 4 years ago

are there any edges that span more than one tree?

Yes, there's quite a bit of this.

jeromekelleher commented 4 years ago

Huh. I'll send him an email, see what he says.