Open jeromekelleher opened 6 months ago
See discussion from @petrelharp here: https://github.com/tskit-dev/tskit/pull/2823#discussion_r1413346530
I think the answer is probably something to do with computing totals or average in divmat, but I'm not really clear either way.
This gist is the basis of the divmat implementation I developed
I've been digging into this a bit to try and understand what's going on. According to the discussion here https://github.com/tskit-dev/tskit/pull/2823/files#r1413346530, for genetic divergence we have that $diversity(I,J)$ is the average over $diversity(i,j)$. For ts.divergence_matrix(), this appears to hold for $I \neq J$, but when we have $I = J$. it seems that $diversity(I,I) is twice the average:
import msprime
import numpy as np
# Simulate a small tree sequence
n_ind = 5
ts = msprime.simulate(sample_size=2 * 5, Ne=1000, length=1e4)
# Compute the divergence matrix
D_s = ts.divergence_matrix(mode="branch")
# Compute the haploid matrix from individuals
A = np.zeros((n_ind, 2*n_ind))
for j in range(n_ind):
A[j, 2*j] = A[j, 2*j + 1] = 1
ind_sets = [(2*i, 2*i + 1) for i in range(n_ind)]
D_i = ts.divergence_matrix(sample_sets = ind_sets, mode = "branch")
D_i_alt = A @ D_s @ A.T / 4 # average over samples
print(D_i / D_i_alt)
[[2. 1. 1. 1. 1.]
[1. 2. 1. 1. 1.]
[1. 1. 2. 1. 1.]
[1. 1. 1. 2. 1.]
[1. 1. 1. 1. 2.]]
I'm not familiar enough with the (implementation of) divergence_matrix to understand if this is to be expected...
If we use D_i_alt
as the divergence matrix from which we calculate the genetic relatedness matrix, then this agrees with the GRM computed from ts.genetic_relatedness()
(up to a constant)
Full code:
import sys
import pandas as pd
import tskit
import msprime
import numpy as np
# Simulate a small tree sequence
n_ind = 5
ts = msprime.simulate(sample_size=2 * 5, Ne=1000, length=1e4)
# Compute the divergence matrix
D_s = ts.divergence_matrix(mode="branch")
# Compute the haploid matrix from individuals
A = np.zeros((n_ind, 2*n_ind))
for j in range(n_ind):
A[j, 2*j] = A[j, 2*j + 1] = 1
ind_sets = [(2*i, 2*i + 1) for i in range(n_ind)]
D_i = ts.divergence_matrix(sample_sets = ind_sets, mode = "branch")
D_i_alt = A @ D_s @ A.T / 4
print(D_i / D_i_alt)
G_s = ts.genetic_relatedness_matrix(mode = "branch")
y_s = np.mean(D_s, axis = 0)
y_i = np.mean(D_i, axis = 0)
y_i_alt = np.mean(D_i_alt, axis = 0)
D_mean_s = np.mean(D_s)
D_mean_i = np.mean(D_i)
D_mean_i_alt = np.mean(D_i_alt)
G_div_s = (y_s[:, np.newaxis] + y_s[np.newaxis, :] - D_s - D_mean_s) * 0.5
np.isclose(G_div_s, G_s).all()
# True
# Compute the genetic relatedness matrix from individuals
G_i = A @ G_s @ A.T
G_i_alt = np.zeros((n_ind, n_ind))
for i in range(n_ind):
for j in range(n_ind):
G_i_alt[i, j] = ts.genetic_relatedness(sample_sets = ind_sets, indexes = (i, j), mode = "branch", proportion=False)
G_div_i_alt = (y_i_alt[:, np.newaxis] + y_i_alt[np.newaxis, :] - D_i_alt - D_mean_i_alt) * 0.5
G_div_i = (y_i[:, np.newaxis] + y_i[np.newaxis, :] - D_i - D_mean_i) * 0.5
np.isclose(G_div_i_alt, G_i / 4).all()
# True
Okay - the behavior observed above is expected, since recall that diversity and divergence are defined "without replacement", e.g. (docs for diversity)
Mean pairwise genetic diversity: the average over all n choose 2 pairs of sample nodes, of the density of sites at which the two carry different alleles, per unit of chromosome length.
and also (docstring for divergence_matrix)
# Finds the mean divergence between pairs of samples from each set of
# samples and in each window. Returns a numpy array indexed by (window,
# sample_set, sample_set). Diagonal entries are corrected so that the
# value gives the mean divergence for *distinct* samples, but it is not
# checked whether the sample_sets are disjoint (so offdiagonals are not
# corrected).
So, the diagonal of D_i_alt
as computed above should be multiplied by 2 since it's diploids; more generally the i
th diagonal it should be multiplied by n[i]/(n[i]-1)
if n[i]
is the size of the i
th sample set.
I'll figure out what's going on in the code and tidy things up.
Okay, this is officially confusing. As someone pointed out, the issue is coming from the fact that we've defined genetic_relatedness
to be the sum (rather than the mean) over the nodes in each sample set.
Summarizing: in the code for genetic_relatedness
we have
for (k = 0; k < state_dim; k++) {
sumx += x[k];
sumn += (double) args.sample_set_sizes[k];
}
meanx = sumx / sumn;
// ...
result[k] = (x[i] - ni * meanx) * (x[j] - nj * meanx) / 2;
which perhaps doesn't contradict but at least doesn't agree with the docs: or the docstring:
Total area of branches in the window ancestral to pairs of samples in two sample sets relative to the rest of the sample sets. To be precise, let B(u,v) denote the total area of all branches ancestral to nodes u and v, and let B(I,J) be the sum of B(u,v) over all nodes u in sample set I and v in sample set J. Let S and T be two independently chosen sample sets. Then for sample sets I and J, this computes E[B(I,J) - B(I,S) - B(J,T) + B(S,T)].
I'm not sure but I think this is three different definitions? For instance, in the code we take the average uniformly across all samples; whereas in the docstring we say it's over all sample sets.
Now the relationship we're trying to use here to compute relatedness from divergence is this: let $Ti$ be the total distance from sample i up to the root; then if $D{ij}$ is the divergence between i and j and $R_{ij}$ is the relatedness between i and j, then
$$T_i + Tj = D{ij} + 2 R{ij}.$$
So, for any samples I, J, S, T:
$$R{IJ} - R{IS} - R{JT} + R{ST} = (D{IJ} - D{IS} - D{JT} + D_{ST}) / (-2) .$$
Since it's true for every I, J, S and T, this is still true if I, J, S, and T are random variables (and they need not be independent). So, this still applies when I=J are two draws without replacement rom the same sample set.
If genetic_relatedness
was defined as an average over sample sets, then we'd be all set, and could just use this relationship. However, it's defined as the sum over the members of the sample sets, contrary to all the other statistics.
This is making me reconsider our decision to define it in that way (as the sum over the sample sets). Next I need to go look at our justification for that.
I don't really have an opinion on how the stat should be defined, but I think it's fine to change change older definitions if we like - they haven't been around for that long, and I doubt many people are using them.
On slack, @gregorgorjanc says:
"relationship/relatedness/kinship/..." is reported either as a covariance coefficient (in that case we consider all genome copies of individuals) or as a probability of identity (in that case we consider random draws of genome copies of individuals). Often authors aren't clear what they report :(
So if it's a covariance, the issue is 'covariance of what' - clearly, we want the covariance of individual's traits, so this comes down to what our model is for traits. If X_ij
is the sum of effects on the j-th chromosome of individual i, then two options are that individual i's trait is either (a) the sum of the X_ij
; or (b) the average of the X_ij
. If everyone has the same ploidy, it doesn't really matter - for instance,
cov[(X_i1 + X_i2)/2, (X_j1 + X_j2)/2] = cov[X_i1 + X_i2, X_j1 + X_j2]/4
so, to convert between (a) and (b) we just multiply by 4 (for diploids). However, if there's some haploids and some diploids then (a) and (b) are not quite so simply related. Still, I suppose that if the ploidy of individual i is p_i then you can convert between them by multiplying the ij-th entry by p_i * p_j
. It's seeming to me that it would seem more natural to quantitative folks to have the trait be the sum of the two chromosomes rather than the average, so using the average will introduce some confusing factors of 2. However, I think it's more confusing that relatedness doesn't follow the convention for sample_sets that the other statistics follows - following the other statistics, we'd expect e.g. ts.genetic_relatedness([ts.samples(population=1), ts.samples(population=2)])
to give the average relatedness between a sample from population 1 and a sample from population 2. However, currently it'd give the sum of their relatednesses.
@petrelharp yes, quantitative genetic models are based on the sum of effects across genome copies. As you say we, averaging could also work if this is well documented so people know how to use the outcome - multiply by 4 for diploids (this might be a bit confusing at first because QG folk might be used to multiplying by 2 when going from probability of identity (often termed as kinship) to covariance, but documenting this well should work - happy to work with you on the docs to make it clear).
Will write another post on haplo-diploid model
Let's explore the haplo-diploid case here to see how this is managed in a recent publication
Theoretical and empirical comparisons of expected and realized relationships for the X-chromosome https://gsejournal.biomedcentral.com/articles/10.1186/s12711-020-00570-6
Some background:
Fernando and Grossman [16] derived rules to estimate a pedigree-based genetic relationship matrix (S)
for the X-chromosome. For a biallelic locus (e.g., A/B), males carry a single copy (coming from their dam)
whereas females carry two copies (coming from their sire and dam). In their model, allelic effects were
identical in males and females. This corresponds to defining the effects of loci in terms of expected
differences across female descendants, but also to absence of dosage compensation (see definition below).
Nevertheless, this relationship matrix can be rescaled to account for different levels of dosage compensation.
As a result, it can be used for all traits and genetic architectures. Defining effects in female descendants is
convenient as these represent an adequate reference population to define additive genotypic values and
receive gametes from both their sire and dam. In addition, Fernando and Grossman [16] assumed
no imprinting—this means that females receiving alleles A and B (from the sire and dam, respectively)
will have the same genotypic value than females receiving alleles B and A (from the sire and dam, respectively).
In this paper, we followed their hypotheses, including the absence of imprinting. By doing this, genomic and
pedigree-based relationships will be compatible and coherent.
Coding of haploid and diploid genotypes
For a biallelic locus on the X-chromosome, numerical coding as gene content may proceed as {0,1} for males
(say, for genotypes {A,B}) and {0,1,2} for females (say, for genotypes {AA,AB,BB}). This coding is consistent
with the theory of Fernando and Grossman [16], and corresponds to the number of biological copies in males
(that are hemizygotes).
Substitution effect and variance
Matrix of genos
Genomic relationships for the X-chromosome
There is more in there, including a section on Treatment of dosage compensation and sex heterogeneity in traits expressed in both sexes.
@petrelharp does any of the above help in deciding on sum vs average for haplo-diploids?
Notes, discussing with @gregorgorjanc:
"method"=["sum", "average"]
Also:
In #2823 we added support for the genetic_relatedness_matrix function as a follow-up usage of
divergence_matrix
using sample sets. This doesn't work when the sample_sets have a length > 1. It's not clear what the problem is, but it's something to do with the conversion of the divergence matrix into the GRM via the normalisation function (currently kludged to observationally work with single-sample sets)