Open petrelharp opened 3 years ago
@petrelharp sorry for being pedantic, but G^T s
gives you additive genetic (or breeding) values, not phenotypes. To get phenotypes we would have to add to G^T s
at least environmental values, as well as effects, appropriate for a given trait, such as sex, age, …
This nomenclature follows the established Fisher (1918) model and 100+ years of nomenclature evolution with quite strong stabilising selection on these terms: Phenotype value = Genetic (or genotypic) value + Environment value, possibly including interaction between the two. Then genetic value is decomposed into additive, dominance and epistatic values.
I propose ts.genetic_values()
which can in principle cover nodes or samples and additive or additive + dominance cases (dominance only with samples).
G^T s gives you additive genetic (or breeding) values, not phenotypes.
Yes, good point! (I was going for 'quick notes', not being careful.)
I propose ts.genetic_values() which can in principle cover nodes or samples and additive or additive + dominance cases (dominance only with samples).
Great idea!
Separately - I've thought it through and I think the efficient way to multiply a vector by the LD matrix is via G G^T
- i.e., by first getting breeding values per sample and then applying G
to this.
So, two matvecs instead of a matmul and a matvec, right?
So, two matvecs instead of a matmul and a matvec, right?
Right - but also, recall that in #1246 we're basically doing G^T G
all at once, as a single matvec. However, I can't see how to do the equivalent for LD in a single matvec - but, it's easy with two matvecs.
Here's a draft of I think a very efficient way to do (1), G^T s
, which is given a vector s
of length equal to the number of mutations, compute for each sample the sum of the values of s
for every mutation that sample carries.
We will iterate left-to-right over the trees, maintaining at all times a vector x
of length equal to the number of nodes, such that x[u]
will be equal to: the sum of the values of s
for all mutations that have been inherited by node u
since the last time the set of samples inherited by u
has changed, except for any mutations still accounted for by x[p(u)]
, where p(u)
is the parent of u
.
So: the output will be out
, and suppose that the output for sample v
is stored in out[k[v]]
.
x
to 0.p
and child c
, traverse from the root down to p
, and for each node v
on this path, add x[v]
to x[u]
for each child u
of v
. If v
is a sample then also add x[v]
to out[k[v]]
. Then set x[v]
to zero.p
and child c
.i
appears on node u
, add s[i]
to x[u]
.Note that this avoids propagating mutations down the tree until "necessary"; i.e., until a maximal shared haplotype in some sense is attained - so, the complexity is, I think, like O((num edges) * (log N) + (num mutations) )
A less efficient way to do this would be to do:
out = np.zeros(ts.num_samples)
for t in ts.trees(sample_lists=True):
for m in t.mutations():
for u in t.samples(m.node):
out[u] += s[m.id]
... but, maybe this is good enough? I think not; see below:
This way does generalize better to the case where we want to compute phenotypes with dominance. For that case, we need to know, for each individual, how many of their nodes are below a given mutation, if the number is nonzero. This fits without our statistics framework already, but the internal state would be equal to the number of individuals, so this is not feasible. If we knew not only sample_lists
but also, say, individual_lists
, then that would do the trick. It would essentially be a sparse representation of the internal state from the statistics framework. However, it is not clear to me that this is very sparse: with N
samples the total length of all sample_lists
is still O(N^2)
. But maybe there is something sparser we could keep track of, like "which individuals have only one node below this one"?
The first option sounds super interesting @petrelharp. I wonder if there's any connection with keeping track of the last time a node was updated (e.g., here, which we might be able to reuse conceptually?
I really like this! It's using the mutation parent column for dynamic programming :heart:
The generalisation to individuals is trickier - how about we get an implementation of the first algorithm implemented and tested, and then think about how to generalise? The sample_lists
idea has never really proved that valuable to be honest - because it's so much work to maintain, it's often quicker to simply do the traversal. This would almost certainly be true of individuals too.
I wonder if there's any connection with keeping track of the last time a node was updated (e.g., here), which we might be able to reuse conceptually?
Ah, you're right! That "last updated time" should be the same - i.e., should give the last time that x
was set to zero. So, that's probably what we need to compute a branch stat version of this (but, haven't thought that through).
The generalisation to individuals is trickier - how about we get an implementation of the first algorithm implemented and tested, and then think about how to generalise?
Sounds good. That's a different issue to this one, anyhow (which is "linear algebra tools").
Sounds good. That's a different issue to this one, anyhow (which is "linear algebra tools").
Was commenting on this:
If we knew not only sample_lists but also, say, individual_lists, then that would do the trick.I
Is there a reason to focus on samples only? I would be interested in sums for ancestors too.
Is there a reason to focus on samples only? I would be interested in sums for ancestors too.
No, I guess not - but, just to check: the sum would be only over the portion of the genome where we have their genotypes, skipping out the "missing data" portions. So, it'd be the contribution to breeding value from the chunk of ancestral haplotype carried by that ancestor. Is that what you want? (We will probably want it to do Bayesian things...)
Yes, "partial" ancestral genomes/individuals will have only "partial" genetic values. In some cases ancestral genomes/individuals could be known in full, say in a simulation or in a pedigreed population, so we should get the full genetic value for such ancestors in these cases (but maybe these are classed as "samples" anyway).
I think that providing genetic values for ancestors (full or partial) will be useful if we want to study evolution of a particular genome segment. I think the same applies also to other node or individual based statistics.
So: the output will be
out
, and suppose that the output for samplev
is stored inout[k[v]]
.
@petrelharp what is k here?
@petrelharp what is k here?
k
is the vector of indices that lets us map from nodes (v
) to index in the output - so, if we wanted output for nodes [0, 2, 4, 1]
then I guess k
would be a vector [0, 3, 1, 0, 2, 0, 0, ..]
. This might not be the best way to arrange things.
Copying Slack discussion also here so we don’t loose it;)
@jeromekelleher you asked what the other multiplication (pre- or left-multiplication) does. Let G
be an nSites x nSamples
matrix, l
be an nSites x 1
vector, and r
be an nSamples x 1
vector.
Today you showed matrix-vector multiplication of G r = x
where x
is an nSites x 1
vector (so we get some "stats" per site) - this is 4. above at https://github.com/tskit-dev/tskit/issues/1547#issue-933945982 as @petrelharp mentioned this is connected to trait_covariance
where we calculate covariance between 1) the number of mutations samples carry at a site and 2) sample phenotypes - so for a site i
we calculate cov(r, G[i,]) = sum_over_j_samples(r_centred[j], G_centred[i,]) / nSamples
(centred means that we removed the mean - for every row in G
and overall in r
). Of note, in a genome-wide association study, we want to regress sample phenotype onto the number of mutations samples carry, to estimate potential association between the two. The simplest estimator is to do one site at a time - here we calculate regression coefficient b_i = cov(r, G[i,]) / var(G[i,])
. This reg coeff is often called allele substitution effect, which tells us if mutation at site I
(=allele substitution) increases or decreases the phenotype.
The other way is l G = y
where y
is an nSamples x 1
vector (so we get some "stats" per sample). This is 1. above at https://github.com/tskit-dev/tskit/issues/1547#issue-933945982 Then 1 G = y
(1
(not l
!) here is a vector of 1s) gives us how many mutations - a "mutation count" (across all sites) samples carry. Further, say, l
holds allele substitution values of all the mutations at the sites (assuming 0/1 mutations or that each mutation is in its own row). Then l G = y
gives a "weighted mutation count" where each mutation is weighted by its effect on a phenotype - we call this additive genetic value (aka breeding value or polygenic risk score) - collective effect of all mutations, assuming additive effects only. @petrelharp proposed an algo for this above https://github.com/tskit-dev/tskit/issues/1547#issuecomment-882235129
Having the l G = y
functionality would mean that we can start doing quantitative genetics simulations with tree sequences!
@jeromekelleher thinking a bit more about the “parsimony approach” you have taken. As you mentioned the nSamples*1 vector, with which you post multiply the genotype matrix, has to be “compressible”, as in, mutations that give rise to the vector values have to map to local tree (trees?) very well. If I am getting this right, this is a strong assumption and it won’t hold for a complex traits, where the samples vector is continuous (say something like a Gaussian) and we don’t have a good way of mapping the sample vector onto individual mutations effects, hence onto trees, but it could be useful in less complex traits?
It's a very strong assumption @gregorgorjanc and it definitely won't hold in general. It's only going to work for things in which the underlying generative process is based on the trees - but since this is the background assumption we're making for a lot things anyway (we think the trees are useful because they are the generating process), I think it's worth exploring.
It's a very strong assumption @gregorgorjanc and it definitely won't hold in general. It's only going to work for things in which the underlying generative process is based on the trees - but since this is the background assumption we're making for a lot things anyway (we think the trees are useful because they are the generating process), I think it's worth exploring.
Totally!
@brieuclehmann is this "compressed" view useful for PCA type calculations? The ancestry that PCA is revealing, is driven exclusively by the tree generating process!
Let
G
be annSites x nSamples
matrix,l
be annSites x 1
vector, andr
be annSamples x 1
vector.
Just a note that in the genetics literature G
symbol is often used to denote the genetic relatedness matrix (GRM) obtained from a genotype matrix (possibly centred or even standardised see here for more).
Thinking about the usefulness of linear algebra operations with tree sequence ... One of key operations in data analysis is solving the least squares problem. There are many flavours of solving the least squares problem. To estimate mutation effects (= allele substitution effects) by regressing phenotypes onto genotypes, I have been using Gauss-Seidel iterative method (see this nice "Technical Note: Computing Strategies in Genome-Wide Selection" where the Gauss-Seidel Residual Update variant is described) - here we are estimating all mutation effects jointly so that we can use them in genomic prediction, not one mutation at a time as mentioned above for GWAS. In Gauss-Seidel Residual update we need the following operations (I have a simple Fortran90 implementation here):
a. a dot product of a mutation row of G
with itself dot(G[mutation,],G[mutation,])
(see here)
b. a dot product of a mutation row of G
with a samples vector r
(dot(G[mutation,],r)
(see here)
c. a product of a mutation row of G
with a scalar k
(G[mutation,]*k)
(see here)
d. a product of sites vector l
with a matrix G
(l G = y
) (see here)
I have been thinking about various quantitative genetics models and tree sequence. In this thread the mentioned linear algebra tools involve the extended genotype matrix G
, so all calculations involve sites and their mutations. Would it be possible to do the same “style” of operations but involving branches instead of sites&mutations? Say, in https://github.com/tskit-dev/tskit/issues/1547#issuecomment-1062744283 one operation was calculating “weighted mutation count”, which is y
inl G = y
, where l
are mutation weights (=effects). I am thinking of calculating the “sum of branch weights”, that is, start at the top of the tree sequence and accumulate weights for each branch as we go down the trees (maybe samples-to-root direction works to?). This is related to general_stat
but there only sample weights are accumulated (from samples towards roots).
Possibly - you'd need to explain the details to me in person though.
I think that yes, definitely! The "expected value of statistic S under infinite sites" is a good definition, and asked to anything we're considering.
On this topic, see also #2882.
Let
G
be the "extended genotype matrix", whose rows are indexed by mutations (not sites) and samples, withG[i,j] = 1
if the sample has inherited that mutation and0
otherwise. This is something people want to work with, but for many applications we don't need to actually return the matrix, but instead just multiply it by things. For instance:s
is a vector of effect sizes, thenG^T s
gives a vector of additive phenotypes computed by adding up the entries ofs
corresponding to all the mutations each sample has inheritedG^T G
is related to the (site) genetic relatedness matrix; so #1246 is something in this direction.G G^T
is similarly related to the LD matrix, and multiplying the LD matrix by things is important in eg estimating heritabiliyG x
is similar to what we've calledtrait_covariance
withmode="site", windows="sites"
- except that gives something per site, not per mutation.Here "related to" means "up to some normalization that I'm not figuring out right now".
There's API and computational issues to work out here. For instance, we could just implement methods that do
G x
andG^T y
, and use these to getG^T G x
; however, that's not how we currently do things in #1246.Preliminary brainstorming: besides #1246,
G^T s
: maybets.additive_phenotypes(samples, effect_sizes)
would returnG^T effect_sizes
. Also, we probably wantts.individual_phenotypes(individuals, effect_sizes, dominance_coefficients)
that does the analogous thing for diploids. (But, maybe it'd be nice to have a single function that you can pass either sample IDs or individual IDs somehow.)I need to better understand what computations are done in LD space to see if this is the right thing, but possibly we should implement
G G^T x
asts.ld_matrix_multiplication(x)
.And, maybe
G x
would be an option totrait_covariance
? Or,mutation_covariance(x)
?