Closed jeromekelleher closed 4 years ago
Page 13 caption to figure 5: “tree sequences inferred by Relate”, not quite tree sequences though, right?
What's this mean?
Hi, @awohns - here's the Genomic descent statistic definition:
... what exactly does "copies from" mean here?
If "copies from" means "inherits from", then our example 1 (ancestry proportions) is similar to the "genomic descent" but not quite the same, since "genomic descent" divides by something - I think that you'd compute the genomic descent statistic like so (although not quite sure about the normalization, which seems strange?):
def genomic_descent(ts, sample_sets):
n = np.array([len(x) for x in sample_sets])
def f(x):
out = x/n
out.append(1.0 * (sum(x) > 0))
return out
A = ts.sample_count_stat(sample_sets, f, len(sample_sets) + 1, polarised=True, mode='node', strict=False)
D = A[:, :-1]
for k in range(D.shape[1]):
D[:, k] /= A[:, -1]
return D
Also - thanks very much for the comments!!
Page 13 caption to figure 5: “tree sequences inferred by Relate”, not quite tree sequences though, right? What's this mean?
I think Wilder's talking about the fact that Relate returns a JBOT (just a bunch of trees; I'm proud of that one, hideous hybrid of storage and popgen jargon). The distinction is being lost on people --- see the section on tsinfer here where they (mistakenly) say that tsinfer returns a JBOT --- so perhaps we should clarify a bit here.
Yes that's what I meant re: relate and tree sequences!
Regarding the genomic descent: copies from means inherits from (I was using the copying terminology because it was in the context of tsinfer). There are two normalisations in the genomic descent stat: 1st term in denominator: the total span of the node appearing at all in the tree sequence: the node might not exist for the whole tree sequence, or it might only be ancestral to samples in the reference sets for part of the sequence. 2nd term in denominator: the size of the reference set. Think of the extreme case where we have one reference set with 90% of our total samples and another with 10%, it will always look like node u is most closely related to the bigger reference sets.
The genomic_descent() function you sent seems equivalent to the definition based on a few examples I tested it on!
Ok, thanks @awohns! This could be a nice example to write up for a tskit vignette. (Note: I simplified the definition above somewhat.)
I've added a note after that example saying that it could be used to compute genomic descent.
I'm not sure what I should say about Relate returning JBOT: how about
Although Relate does not directly return a tree sequence, its output can be converted to a tree sequence.
... ?
I think we need to be more explicit with Relate:
Relate output can be converted into tree sequence format, albeit inefficiently as each tree must be stored independently. Although much of the efficiency of tree sequences are lost when no edges are shared between adjacent trees, branch and sites statistics are still well defined and can be computed (node statistics are less meaningful in this context).
As I say, people aren't getting the fact that Relate output is a JBOT (or worse, that tree sequences in general aren't JBOTs) so we need to make the point.
albeit inefficiently as each tree must be stored independently
So, I'm confused on this point. Looking at the Relate tree sequences, there are nodes shared across multiple trees, I'm pretty sure - do you mean something different? Or am I wrong on this?
Leo might have put in some optimisations to reduce the total number of nodes with some sort of reusing strategy - are there any edges that span more than one tree?
are there any edges that span more than one tree?
Yes, there's quite a bit of this.
Huh. I'll send him an email, see what he says.
comments from @awohns
Page 5, Example 1 Ancestry proportions: this seems quite similar, if not equivalent to the Genomic Descent statistic I defined in https://www.tandfonline.com/doi/full/10.1080/03014460.2019.1623912.
Page 13 caption to figure 5: “tree sequences inferred by Relate”, not quite tree sequences though, right?