tskit-dev / tskit

Population-scale genomics
MIT License
154 stars 73 forks source link

Calculate squared correlation to assess imputation performance #2200

Open szhan opened 2 years ago

szhan commented 2 years ago

Another metric is the squared correlation, which is simply the square of the Pearson correlation coefficient between the allele dosage of the true genotypes and the allele dosage of the imputed genotypes. In a diploid genome, the AD of 0|0 is 0; 0|1 and 1|0 is 1; 1|1 is 2. SR is pertinent to GWAS, because it has been shown that higher mean SR across sites can mean higher power to discover variants associated with a trait or disease. Linking #2193.

A function in the Variant class that allows us to get SR site by site would be good.

site_sq_corr = []
for variant1, variant2 in zip(ts_true.variants(), ts_imputed.variants()):
  sq_corr = variant1.squared_correlation(variant2)
  site_sq_corr.append(sq_corr)
jeromekelleher commented 2 years ago

This is so simple that maybe what we want is a just method to return the dosage instead? So, we'd do something like

site_sq_corr = []
for variant1, variant2 in zip(ts_true.variants(), ts_imputed.variants()):
  sq_corr = np.corrcoef(variant1.dosage(), variant2.dosage())**2
  site_sq_corr.append(sq_corr)