tskit-dev / tskit

Population-scale genomics
MIT License
153 stars 72 forks source link

Calculate concordance to assess imputation performance #2199

Open szhan opened 2 years ago

szhan commented 2 years ago

We want to implement functions in the Variant class to calculate various metrics to evaluate genotype imputation performance (discussed #2193).

The simplest metric is the overall concordance, which is the proportion of genotypes imputed correctly. Some studies look at the non-reference concordance, which considers only the genotypes that contain a non-reference allele. I am thinking that we can implement a function that allows us to compute the concordance between the same sites in two tree sequences. ts_true contains the known genotypes (e.g., from msprime), and ts_imputed contains the imputed genotypes.

site_concordance = []
for variant1, variant2 in zip(ts_true.variants(), ts_imputed.variants()):
     concordance = variant1.concordance(variant2, non_reference_only = False).
     site_concordance.append(concordance)

If the option non_reference_only is set to True, then it is returning the non-reference concordance instead of the overall concordance.

jeromekelleher commented 2 years ago

LGTM. We'd have to think a bit about how to specify the reference allele (we don't usually store it, it's the ancestral state we think about).

A good first pass would be to implement it for all values though.

Is this really a "rate" though for the method name? Maybe just something like

def concordance(self, other):
     """
     Return the fraction of sample for which the genotypes are equal.
     """
     if other.genotypes.shape != self.genotypes.shape:
          raise ValueError("Variants not comparable")
     n = self.genotypes.shape[0]
     if self.alleles == other.alleles:
         # Quick path:
         return np.sum(self.genotypes == other.genotypes) / n
     else:
         # Slow path - remap the alleles so the genotypes are comparable.
         ...
benjeffery commented 2 years ago

Just chatted to @szhan about this to confirm this will belong on the Variant class next to genotypes et al. Currently all of this class's methods are C based - we can move this to C later if it seems wise, but I suspect we won't get much benefit if as it looks numpy heavy.

petrelharp commented 2 years ago

I think this is missing something - won't the tree sequence contain all the samples, including the reference panel? And you only want the concordance rate on the ones that have been imputed?

I'm also not totally sure this needs to be a method, since can't you just do

site_concordance_rates = []
for variant1, variant2 in zip(ts_true.variants(), ts_imputed.variants()):
     rate = np.sum(variant1.genotypes == variant2.genotypes) / n
     site_concordance_rates.append(rate)

(I can't think of a situation where you'd actually need to do the "slow path", ie the allele remapping?)

jeromekelleher commented 2 years ago

I can imagine the alleles being different all right, if you have two different tsinfered trees with say a multiallelic site where parsimony has been used. The order of the alleles can easily get shifted then. So implementing the slow-path properly and not having to worry about that is the main value of the method to me.

petrelharp commented 2 years ago

I'm not following here - suppose you've got alleles at a bunch of samples at a multiallelic site, say A, G, and T, and you want to impute alleles at some other sites; when is it the case you'd want to check for concordance-under-remapping and not straight concordance? I'm missing something, I assume.

jeromekelleher commented 2 years ago

So, suppose ts1 = msprime.sim(...) and it contains a bunch of multiallelic sites. We run tsinfer on this to get ts2, and the alleles get shuffled. We want to verify the variants in one are the same as the variants in the other (let's forget about imputation for now). Something like this method would be handy for that alone (we do it a lot in the tsinfer test suite).

petrelharp commented 2 years ago

Oh! I see, sorry - the alleles are the same, but the order of them in the alleles array is different. (I was imagining comparing alleles, not indexes into the alleles array.)