Aaron's comments - Githubissues

petrelharp commented 4 years ago

Great comments from @apragsdale:

Overall, reads pretty clearly, though it's pretty thick with definitions and
examples. If target audience is heavy pop-gen readers, probably not a problem,
but I can imagine it being fairly inaccessible to broader readers.

In the intro in Framework and statistics, you dive straight into ways of
generalizing statistics to compute them over a tree, talkinga bout propagating
weights up trees, using ``nodes'' and ``edges'' and other tree terminology, but
you haven't yet introduced the tree sequences. Might make sense to reorder this
section and put \section{Tree sequences} before this so that we have the
terminology and data structure in mind before talking about computing things on
trees?

When computing statistics, sometimes you have the factor $\frac{1}{L}$ and
sometimes not - is there a reason why sometimes you normalize by sequence
length, but sometimes not? Similarly, in the application to $f_4$ statistics
from simulated data, you report a significant negative value of $-700$ (not
scaled by sequence length).

Factor of 2 in Example 2, since we don't care about the order of drawing
alleles when measuring heterozygosity. Similarly, in Example 3, should
$f(x_1,x_2) = \frac{x_1(n_2-x_2)}{n_1n_2} + \frac{(n_1-x_1)x_2}{n_1n_2}$ to
account for the two ways of finding differences?

Figure 5 is a bit confusing - why are the scales so different between site and
branch stats? If we want to compare the site to branch statistics from each
inferred tree sequence, it would be helpful to see them on top of each other
(maybe just using one or two sampling locations, and putting the rest in the
supp?). Probably would need to normalize the branch stats by sequence length? I
always find it hard to eyeball between figure panels. Also, why are the scales
between each of the inferred tree sequences' branch statistics so different?
Relate and GEVA have some shared general trends, but there's a maybe 2-4-fold
difference (or more?) between their magnitudes.

petrelharp commented 4 years ago

Might make sense to reorder this section and put \section{Tree sequences} before this so that we have the terminology and data structure in mind before talking about computing things on trees?

Well, I wanted to keep it the way it is, but Reviewer 2 says just the same thing. I'll see what I can do.

petrelharp commented 4 years ago

When computing statistics, sometimes you have the factor $\frac{1}{L}$ and sometimes not - is there a reason why sometimes you normalize by sequence length, but sometimes not?

Well, the answer to this is that we don't normalize when looking at a single thing (a single site or a single tree) but we do when looking at a stretch of the genome. I'm not sure where to explain this, though. We do introduce the Site and Branch stats as averages over a region of the genome?

petrelharp commented 4 years ago

Factor of 2 in Example 2

We've got an explanatory note after Example 2 now.

petrelharp commented 4 years ago

Figure 5 is a bit confusing - why are the scales so different between site and branch stats?

Well, the caption explains that "The ratio of the Site statistic to the Branch statistic [...] hovers around typical per-generation estimates of the human mutation rate"

I think maybe this was talking about a previous version of the figure.

petrelharp commented 4 years ago

Ok - I think we've actually dealt with all these already, except as noted above. Any objections to closing this?

jeromekelleher commented 4 years ago

Nope, closing.

petrelharp / treestats_ms

Aaron's comments #46