tskit-dev / tskit

Population-scale genomics
MIT License
155 stars 73 forks source link

add more linear algebra tools #1547

Open petrelharp opened 3 years ago

petrelharp commented 3 years ago

Let G be the "extended genotype matrix", whose rows are indexed by mutations (not sites) and samples, with G[i,j] = 1 if the sample has inherited that mutation and 0 otherwise. This is something people want to work with, but for many applications we don't need to actually return the matrix, but instead just multiply it by things. For instance:

  1. If s is a vector of effect sizes, then G^T s gives a vector of additive phenotypes computed by adding up the entries of s corresponding to all the mutations each sample has inherited
  2. G^T G is related to the (site) genetic relatedness matrix; so #1246 is something in this direction.
  3. G G^T is similarly related to the LD matrix, and multiplying the LD matrix by things is important in eg estimating heritabiliy
  4. G x is similar to what we've called trait_covariance with mode="site", windows="sites" - except that gives something per site, not per mutation.

Here "related to" means "up to some normalization that I'm not figuring out right now".

There's API and computational issues to work out here. For instance, we could just implement methods that do G x and G^T y, and use these to get G^T G x; however, that's not how we currently do things in #1246.

Preliminary brainstorming: besides #1246,

gregorgorjanc commented 3 years ago

@petrelharp sorry for being pedantic, but G^T s gives you additive genetic (or breeding) values, not phenotypes. To get phenotypes we would have to add to G^T s at least environmental values, as well as effects, appropriate for a given trait, such as sex, age, …

This nomenclature follows the established Fisher (1918) model and 100+ years of nomenclature evolution with quite strong stabilising selection on these terms: Phenotype value = Genetic (or genotypic) value + Environment value, possibly including interaction between the two. Then genetic value is decomposed into additive, dominance and epistatic values.

I propose ts.genetic_values() which can in principle cover nodes or samples and additive or additive + dominance cases (dominance only with samples).

petrelharp commented 3 years ago

G^T s gives you additive genetic (or breeding) values, not phenotypes.

Yes, good point! (I was going for 'quick notes', not being careful.)

I propose ts.genetic_values() which can in principle cover nodes or samples and additive or additive + dominance cases (dominance only with samples).

Great idea!

petrelharp commented 3 years ago

Separately - I've thought it through and I think the efficient way to multiply a vector by the LD matrix is via G G^T - i.e., by first getting breeding values per sample and then applying G to this.

gregorgorjanc commented 3 years ago

So, two matvecs instead of a matmul and a matvec, right?

petrelharp commented 3 years ago

So, two matvecs instead of a matmul and a matvec, right?

Right - but also, recall that in #1246 we're basically doing G^T G all at once, as a single matvec. However, I can't see how to do the equivalent for LD in a single matvec - but, it's easy with two matvecs.

petrelharp commented 3 years ago

Here's a draft of I think a very efficient way to do (1), G^T s, which is given a vector s of length equal to the number of mutations, compute for each sample the sum of the values of s for every mutation that sample carries.

We will iterate left-to-right over the trees, maintaining at all times a vector x of length equal to the number of nodes, such that x[u] will be equal to: the sum of the values of s for all mutations that have been inherited by node u since the last time the set of samples inherited by u has changed, except for any mutations still accounted for by x[p(u)], where p(u) is the parent of u.

So: the output will be out, and suppose that the output for sample v is stored in out[k[v]].

  1. Initialize x to 0.
  2. After adding an edge with parent p and child c, traverse from the root down to p, and for each node v on this path, add x[v] to x[u] for each child u of v. If v is a sample then also add x[v] to out[k[v]]. Then set x[v] to zero.
  3. Do the same before removing an edge with parent p and child c.
  4. If a mutation i appears on node u, add s[i] to x[u].

Note that this avoids propagating mutations down the tree until "necessary"; i.e., until a maximal shared haplotype in some sense is attained - so, the complexity is, I think, like O((num edges) * (log N) + (num mutations) )

petrelharp commented 3 years ago

A less efficient way to do this would be to do:

out = np.zeros(ts.num_samples)
for t in ts.trees(sample_lists=True):
   for m in t.mutations():
     for u in t.samples(m.node):
       out[u] += s[m.id]

... but, maybe this is good enough? I think not; see below:

This way does generalize better to the case where we want to compute phenotypes with dominance. For that case, we need to know, for each individual, how many of their nodes are below a given mutation, if the number is nonzero. This fits without our statistics framework already, but the internal state would be equal to the number of individuals, so this is not feasible. If we knew not only sample_lists but also, say, individual_lists, then that would do the trick. It would essentially be a sparse representation of the internal state from the statistics framework. However, it is not clear to me that this is very sparse: with N samples the total length of all sample_lists is still O(N^2). But maybe there is something sparser we could keep track of, like "which individuals have only one node below this one"?

jeromekelleher commented 3 years ago

The first option sounds super interesting @petrelharp. I wonder if there's any connection with keeping track of the last time a node was updated (e.g., here, which we might be able to reuse conceptually?

I really like this! It's using the mutation parent column for dynamic programming :heart:

jeromekelleher commented 3 years ago

The generalisation to individuals is trickier - how about we get an implementation of the first algorithm implemented and tested, and then think about how to generalise? The sample_lists idea has never really proved that valuable to be honest - because it's so much work to maintain, it's often quicker to simply do the traversal. This would almost certainly be true of individuals too.

petrelharp commented 3 years ago

I wonder if there's any connection with keeping track of the last time a node was updated (e.g., here), which we might be able to reuse conceptually?

Ah, you're right! That "last updated time" should be the same - i.e., should give the last time that x was set to zero. So, that's probably what we need to compute a branch stat version of this (but, haven't thought that through).

petrelharp commented 3 years ago

The generalisation to individuals is trickier - how about we get an implementation of the first algorithm implemented and tested, and then think about how to generalise?

Sounds good. That's a different issue to this one, anyhow (which is "linear algebra tools").

jeromekelleher commented 3 years ago

Sounds good. That's a different issue to this one, anyhow (which is "linear algebra tools").

Was commenting on this:

If we knew not only sample_lists but also, say, individual_lists, then that would do the trick.I

gregorgorjanc commented 3 years ago

Is there a reason to focus on samples only? I would be interested in sums for ancestors too.

petrelharp commented 3 years ago

Is there a reason to focus on samples only? I would be interested in sums for ancestors too.

No, I guess not - but, just to check: the sum would be only over the portion of the genome where we have their genotypes, skipping out the "missing data" portions. So, it'd be the contribution to breeding value from the chunk of ancestral haplotype carried by that ancestor. Is that what you want? (We will probably want it to do Bayesian things...)

gregorgorjanc commented 3 years ago

Yes, "partial" ancestral genomes/individuals will have only "partial" genetic values. In some cases ancestral genomes/individuals could be known in full, say in a simulation or in a pedigreed population, so we should get the full genetic value for such ancestors in these cases (but maybe these are classed as "samples" anyway).

I think that providing genetic values for ancestors (full or partial) will be useful if we want to study evolution of a particular genome segment. I think the same applies also to other node or individual based statistics.

gregorgorjanc commented 3 years ago

So: the output will be out, and suppose that the output for sample v is stored in out[k[v]].

@petrelharp what is k here?

petrelharp commented 3 years ago

@petrelharp what is k here?

k is the vector of indices that lets us map from nodes (v) to index in the output - so, if we wanted output for nodes [0, 2, 4, 1] then I guess k would be a vector [0, 3, 1, 0, 2, 0, 0, ..]. This might not be the best way to arrange things.

gregorgorjanc commented 2 years ago

Copying Slack discussion also here so we don’t loose it;)

@jeromekelleher you asked what the other multiplication (pre- or left-multiplication) does. Let G be an nSites x nSamples matrix, l be an nSites x 1 vector, and r be an nSamples x 1 vector.

Today you showed matrix-vector multiplication of G r = x where x is an nSites x 1 vector (so we get some "stats" per site) - this is 4. above at https://github.com/tskit-dev/tskit/issues/1547#issue-933945982 as @petrelharp mentioned this is connected to trait_covariance where we calculate covariance between 1) the number of mutations samples carry at a site and 2) sample phenotypes - so for a site i we calculate cov(r, G[i,]) = sum_over_j_samples(r_centred[j], G_centred[i,]) / nSamples (centred means that we removed the mean - for every row in G and overall in r). Of note, in a genome-wide association study, we want to regress sample phenotype onto the number of mutations samples carry, to estimate potential association between the two. The simplest estimator is to do one site at a time - here we calculate regression coefficient b_i = cov(r, G[i,]) / var(G[i,]). This reg coeff is often called allele substitution effect, which tells us if mutation at site I (=allele substitution) increases or decreases the phenotype.

The other way is l G = y where y is an nSamples x 1 vector (so we get some "stats" per sample). This is 1. above at https://github.com/tskit-dev/tskit/issues/1547#issue-933945982 Then 1 G = y (1 (not l!) here is a vector of 1s) gives us how many mutations - a "mutation count" (across all sites) samples carry. Further, say, l holds allele substitution values of all the mutations at the sites (assuming 0/1 mutations or that each mutation is in its own row). Then l G = y gives a "weighted mutation count" where each mutation is weighted by its effect on a phenotype - we call this additive genetic value (aka breeding value or polygenic risk score) - collective effect of all mutations, assuming additive effects only. @petrelharp proposed an algo for this above https://github.com/tskit-dev/tskit/issues/1547#issuecomment-882235129

Having the l G = y functionality would mean that we can start doing quantitative genetics simulations with tree sequences!

gregorgorjanc commented 2 years ago

@jeromekelleher thinking a bit more about the “parsimony approach” you have taken. As you mentioned the nSamples*1 vector, with which you post multiply the genotype matrix, has to be “compressible”, as in, mutations that give rise to the vector values have to map to local tree (trees?) very well. If I am getting this right, this is a strong assumption and it won’t hold for a complex traits, where the samples vector is continuous (say something like a Gaussian) and we don’t have a good way of mapping the sample vector onto individual mutations effects, hence onto trees, but it could be useful in less complex traits?

jeromekelleher commented 2 years ago

It's a very strong assumption @gregorgorjanc and it definitely won't hold in general. It's only going to work for things in which the underlying generative process is based on the trees - but since this is the background assumption we're making for a lot things anyway (we think the trees are useful because they are the generating process), I think it's worth exploring.

gregorgorjanc commented 2 years ago

It's a very strong assumption @gregorgorjanc and it definitely won't hold in general. It's only going to work for things in which the underlying generative process is based on the trees - but since this is the background assumption we're making for a lot things anyway (we think the trees are useful because they are the generating process), I think it's worth exploring.

Totally!

@brieuclehmann is this "compressed" view useful for PCA type calculations? The ancestry that PCA is revealing, is driven exclusively by the tree generating process!

gregorgorjanc commented 2 years ago

Let G be an nSites x nSamples matrix, l be an nSites x 1 vector, and r be an nSamples x 1 vector.

Just a note that in the genetics literature G symbol is often used to denote the genetic relatedness matrix (GRM) obtained from a genotype matrix (possibly centred or even standardised see here for more).

gregorgorjanc commented 2 years ago

Thinking about the usefulness of linear algebra operations with tree sequence ... One of key operations in data analysis is solving the least squares problem. There are many flavours of solving the least squares problem. To estimate mutation effects (= allele substitution effects) by regressing phenotypes onto genotypes, I have been using Gauss-Seidel iterative method (see this nice "Technical Note: Computing Strategies in Genome-Wide Selection" where the Gauss-Seidel Residual Update variant is described) - here we are estimating all mutation effects jointly so that we can use them in genomic prediction, not one mutation at a time as mentioned above for GWAS. In Gauss-Seidel Residual update we need the following operations (I have a simple Fortran90 implementation here):

a. a dot product of a mutation row of G with itself dot(G[mutation,],G[mutation,]) (see here) b. a dot product of a mutation row of G with a samples vector r (dot(G[mutation,],r) (see here) c. a product of a mutation row of G with a scalar k (G[mutation,]*k) (see here) d. a product of sites vector l with a matrix G (l G = y) (see here)

gregorgorjanc commented 2 years ago

I have been thinking about various quantitative genetics models and tree sequence. In this thread the mentioned linear algebra tools involve the extended genotype matrix G, so all calculations involve sites and their mutations. Would it be possible to do the same “style” of operations but involving branches instead of sites&mutations? Say, in https://github.com/tskit-dev/tskit/issues/1547#issuecomment-1062744283 one operation was calculating “weighted mutation count”, which is y inl G = y, where l are mutation weights (=effects). I am thinking of calculating the “sum of branch weights”, that is, start at the top of the tree sequence and accumulate weights for each branch as we go down the trees (maybe samples-to-root direction works to?). This is related to general_stat but there only sample weights are accumulated (from samples towards roots).

jeromekelleher commented 2 years ago

Possibly - you'd need to explain the details to me in person though.

petrelharp commented 2 years ago

I think that yes, definitely! The "expected value of statistic S under infinite sites" is a good definition, and asked to anything we're considering.

petrelharp commented 10 months ago

On this topic, see also #2882.