Sample_sets incorrect in genetic_relatedness_matrix

jeromekelleher commented 6 months ago

In #2823 we added support for the genetic_relatedness_matrix function as a follow-up usage of divergence_matrix using sample sets. This doesn't work when the sample_sets have a length > 1. It's not clear what the problem is, but it's something to do with the conversion of the divergence matrix into the GRM via the normalisation function (currently kludged to observationally work with single-sample sets)

jeromekelleher commented 6 months ago

See discussion from @petrelharp here: https://github.com/tskit-dev/tskit/pull/2823#discussion_r1413346530

jeromekelleher commented 6 months ago

I think the answer is probably something to do with computing totals or average in divmat, but I'm not really clear either way.

jeromekelleher commented 6 months ago

This gist is the basis of the divmat implementation I developed

brieuclehmann commented 4 months ago

I've been digging into this a bit to try and understand what's going on. According to the discussion here https://github.com/tskit-dev/tskit/pull/2823/files#r1413346530, for genetic divergence we have that $diversity(I,J)$ is the average over $diversity(i,j)$. For ts.divergence_matrix(), this appears to hold for $I \neq J$, but when we have $I = J$. it seems that $diversity(I,I) is twice the average:

import msprime
import numpy as np

# Simulate a small tree sequence
n_ind = 5
ts = msprime.simulate(sample_size=2 * 5, Ne=1000, length=1e4)

# Compute the divergence matrix
D_s = ts.divergence_matrix(mode="branch")

# Compute the haploid matrix from individuals
A = np.zeros((n_ind, 2*n_ind))
for j in range(n_ind):
    A[j, 2*j] = A[j, 2*j + 1] = 1
ind_sets = [(2*i, 2*i + 1) for i in range(n_ind)]

D_i = ts.divergence_matrix(sample_sets = ind_sets, mode = "branch")
D_i_alt = A @ D_s @ A.T / 4  # average over samples

print(D_i / D_i_alt)
[[2. 1. 1. 1. 1.]
 [1. 2. 1. 1. 1.]
 [1. 1. 2. 1. 1.]
 [1. 1. 1. 2. 1.]
 [1. 1. 1. 1. 2.]]

brieuclehmann commented 4 months ago

I'm not familiar enough with the (implementation of) divergence_matrix to understand if this is to be expected...

brieuclehmann commented 4 months ago

If we use D_i_alt as the divergence matrix from which we calculate the genetic relatedness matrix, then this agrees with the GRM computed from ts.genetic_relatedness() (up to a constant)

brieuclehmann commented 4 months ago

Full code:

import sys
import pandas as pd
import tskit
import msprime
import numpy as np

# Simulate a small tree sequence
n_ind = 5
ts = msprime.simulate(sample_size=2 * 5, Ne=1000, length=1e4)

# Compute the divergence matrix
D_s = ts.divergence_matrix(mode="branch")

# Compute the haploid matrix from individuals
A = np.zeros((n_ind, 2*n_ind))
for j in range(n_ind):
    A[j, 2*j] = A[j, 2*j + 1] = 1
ind_sets = [(2*i, 2*i + 1) for i in range(n_ind)]

D_i = ts.divergence_matrix(sample_sets = ind_sets, mode = "branch")
D_i_alt = A @ D_s @ A.T / 4
print(D_i / D_i_alt)

G_s = ts.genetic_relatedness_matrix(mode = "branch")

y_s = np.mean(D_s, axis = 0)
y_i = np.mean(D_i, axis = 0)
y_i_alt = np.mean(D_i_alt, axis = 0)

D_mean_s = np.mean(D_s)
D_mean_i = np.mean(D_i)
D_mean_i_alt = np.mean(D_i_alt)

G_div_s = (y_s[:, np.newaxis] + y_s[np.newaxis, :] - D_s - D_mean_s) * 0.5
np.isclose(G_div_s, G_s).all()
# True

# Compute the genetic relatedness matrix from individuals
G_i = A @ G_s @ A.T
G_i_alt = np.zeros((n_ind, n_ind))
for i in range(n_ind):
    for j in range(n_ind):
        G_i_alt[i, j] = ts.genetic_relatedness(sample_sets = ind_sets, indexes = (i, j), mode = "branch", proportion=False)

G_div_i_alt = (y_i_alt[:, np.newaxis] + y_i_alt[np.newaxis, :] - D_i_alt - D_mean_i_alt) * 0.5
G_div_i = (y_i[:, np.newaxis] + y_i[np.newaxis, :] - D_i - D_mean_i) * 0.5

np.isclose(G_div_i_alt, G_i / 4).all()
# True

petrelharp commented 2 months ago

Okay - the behavior observed above is expected, since recall that diversity and divergence are defined "without replacement", e.g. (docs for diversity)

Mean pairwise genetic diversity: the average over all n choose 2 pairs of sample nodes, of the density of sites at which the two carry different alleles, per unit of chromosome length.

and also (docstring for divergence_matrix)

    #    Finds the mean divergence  between pairs of samples from each set of
    #     samples and in each window. Returns a numpy array indexed by (window,
    #     sample_set, sample_set).  Diagonal entries are corrected so that the
    #     value gives the mean divergence for *distinct* samples, but it is not
    #     checked whether the sample_sets are disjoint (so offdiagonals are not
    #     corrected).

So, the diagonal of D_i_alt as computed above should be multiplied by 2 since it's diploids; more generally the ith diagonal it should be multiplied by n[i]/(n[i]-1) if n[i] is the size of the ith sample set.

I'll figure out what's going on in the code and tidy things up.

petrelharp commented 2 months ago

Okay, this is officially confusing. As someone pointed out, the issue is coming from the fact that we've defined genetic_relatedness to be the sum (rather than the mean) over the nodes in each sample set.

Summarizing: in the code for genetic_relatedness we have

    for (k = 0; k < state_dim; k++) {
        sumx += x[k];
        sumn += (double) args.sample_set_sizes[k];
    }
    meanx = sumx / sumn;
    // ...
        result[k] = (x[i] - ni * meanx) * (x[j] - nj * meanx) / 2;

which perhaps doesn't contradict but at least doesn't agree with the docs: Screenshot from 2024-05-13 21-51-09 or the docstring:

Total area of branches in the window ancestral to pairs of samples in two sample sets relative to the rest of the sample sets. To be precise, let B(u,v) denote the total area of all branches ancestral to nodes u and v, and let B(I,J) be the sum of B(u,v) over all nodes u in sample set I and v in sample set J. Let S and T be two independently chosen sample sets. Then for sample sets I and J, this computes E[B(I,J) - B(I,S) - B(J,T) + B(S,T)].

I'm not sure but I think this is three different definitions? For instance, in the code we take the average uniformly across all samples; whereas in the docstring we say it's over all sample sets.

Now the relationship we're trying to use here to compute relatedness from divergence is this: let $Ti$ be the total distance from sample i up to the root; then if $D{ij}$ is the divergence between i and j and $R_{ij}$ is the relatedness between i and j, then $$T_i + Tj = D{ij} + 2 R{ij}.$$ So, for any samples I, J, S, T: $$R{IJ} - R{IS} - R{JT} + R{ST} = (D{IJ} - D{IS} - D{JT} + D_{ST}) / (-2) .$$ Since it's true for every I, J, S and T, this is still true if I, J, S, and T are random variables (and they need not be independent). So, this still applies when I=J are two draws without replacement rom the same sample set. If genetic_relatedness was defined as an average over sample sets, then we'd be all set, and could just use this relationship. However, it's defined as the sum over the members of the sample sets, contrary to all the other statistics.

This is making me reconsider our decision to define it in that way (as the sum over the sample sets). Next I need to go look at our justification for that.

jeromekelleher commented 2 months ago

I don't really have an opinion on how the stat should be defined, but I think it's fine to change change older definitions if we like - they haven't been around for that long, and I doubt many people are using them.

petrelharp commented 2 months ago

On slack, @gregorgorjanc says:

"relationship/relatedness/kinship/..." is reported either as a covariance coefficient (in that case we consider all genome copies of individuals) or as a probability of identity (in that case we consider random draws of genome copies of individuals). Often authors aren't clear what they report :(

So if it's a covariance, the issue is 'covariance of what' - clearly, we want the covariance of individual's traits, so this comes down to what our model is for traits. If X_ij is the sum of effects on the j-th chromosome of individual i, then two options are that individual i's trait is either (a) the sum of the X_ij; or (b) the average of the X_ij. If everyone has the same ploidy, it doesn't really matter - for instance,

cov[(X_i1 + X_i2)/2, (X_j1 + X_j2)/2] = cov[X_i1 + X_i2, X_j1 + X_j2]/4

so, to convert between (a) and (b) we just multiply by 4 (for diploids). However, if there's some haploids and some diploids then (a) and (b) are not quite so simply related. Still, I suppose that if the ploidy of individual i is p_i then you can convert between them by multiplying the ij-th entry by p_i * p_j. It's seeming to me that it would seem more natural to quantitative folks to have the trait be the sum of the two chromosomes rather than the average, so using the average will introduce some confusing factors of 2. However, I think it's more confusing that relatedness doesn't follow the convention for sample_sets that the other statistics follows - following the other statistics, we'd expect e.g. ts.genetic_relatedness([ts.samples(population=1), ts.samples(population=2)]) to give the average relatedness between a sample from population 1 and a sample from population 2. However, currently it'd give the sum of their relatednesses.

gregorgorjanc commented 2 months ago

@petrelharp yes, quantitative genetic models are based on the sum of effects across genome copies. As you say we, averaging could also work if this is well documented so people know how to use the outcome - multiply by 4 for diploids (this might be a bit confusing at first because QG folk might be used to multiplying by 2 when going from probability of identity (often termed as kinship) to covariance, but documenting this well should work - happy to work with you on the docs to make it clear).

Will write another post on haplo-diploid model

gregorgorjanc commented 2 months ago

Let's explore the haplo-diploid case here to see how this is managed in a recent publication

Theoretical and empirical comparisons of expected and realized relationships for the X-chromosome https://gsejournal.biomedcentral.com/articles/10.1186/s12711-020-00570-6

Some background:

Fernando and Grossman [16] derived rules to estimate a pedigree-based genetic relationship matrix (S)
for the X-chromosome. For a biallelic locus (e.g., A/B), males carry a single copy (coming from their dam)
whereas females carry two copies (coming from their sire and dam). In their model, allelic effects were
identical in males and females. This corresponds to defining the effects of loci in terms of expected
differences across female descendants, but also to absence of dosage compensation (see definition below).

Nevertheless, this relationship matrix can be rescaled to account for different levels of dosage compensation.
As a result, it can be used for all traits and genetic architectures. Defining effects in female descendants is
convenient as these represent an adequate reference population to define additive genotypic values and
receive gametes from both their sire and dam. In addition, Fernando and Grossman [16] assumed
no imprinting—this means that females receiving alleles A and B (from the sire and dam, respectively)
will have the same genotypic value than females receiving alleles B and A (from the sire and dam, respectively).
In this paper, we followed their hypotheses, including the absence of imprinting. By doing this, genomic and
pedigree-based relationships will be compatible and coherent.

Coding of haploid and diploid genotypes

For a biallelic locus on the X-chromosome, numerical coding as gene content may proceed as {0,1} for males
(say, for genotypes {A,B}) and {0,1,2} for females (say, for genotypes {AA,AB,BB}). This coding is consistent
with the theory of Fernando and Grossman [16], and corresponds to the number of biological copies in males
(that are hemizygotes).

Substitution effect and variance

Matrix of genos

Genomic relationships for the X-chromosome

There is more in there, including a section on Treatment of dosage compensation and sex heterogeneity in traits expressed in both sexes.

@petrelharp does any of the above help in deciding on sum vs average for haplo-diploids?

petrelharp commented 2 months ago

Notes, discussing with @gregorgorjanc:

one could compute the covariance between individual traits, which would depend on the model of dosage compensation, or between offspring traits (e.g., as above in females, to standardize ploidy)
in the text copied above, G^X takes the "sum" approach
"group coancestry" refers to the average kinship in a group, so is like the 'average' method
we could add an additional argument to toggle between "sum" and "average", perhaps "method"=["sum", "average"]

Also:

usually one defines coancestry, etcetera sampling with replacement (see Cockerham 1967)
note that for the covariance matrix of additive traits, we also want the self-comparisons
so, by default at least we should include the self comparisons (and maybe have an option to not)

tskit-dev / tskit

Sample_sets incorrect in genetic_relatedness_matrix #2888