Standardise biallelic diplotypes and handling of missing calls

We now have a function biallelic_diplotypes() which computes an array of shape (variants, samples) representing genotypes at biallelic sites.

The current implementation codes each call as an integer counting the count of the second allele. Missing calls do not receive any special coding, and so get a count of 0, which is the same value for a genotype homozygous for the first allele. This isn't ideal as it will cause a bias towards the reference allele in general, because this is most commonly the first allele. This bias can be minimised, by using the max_missing_an parameter to select only sites with low or zero missingness. However, in general it would be better to have missing calls represented in the data, and then explicitly dealt with in any functions that use the data.

Current functions downstream of biallelic_diplotypes() are pca() and plot_njt() (via biallelic_pairwise_distances()). Neither of these functions have any special handling of missing calls, and so would need to be adapted.

A planned function to be added in the next release is biallelic_snp_calls_to_plink() to write out data in plink binary format (#248). The convention for these data is to represent diplotypes as the count of the first allele (rather than the second) and to code missing calls as -127.

To allow for consistency across all functions downstream of biallelic_diplotypes() I propose to:

Modify biallelic_diplotypes() to count the first allele, rather than the second, and code missing calls as -127.
Modify the implementation of biallelic_pairwise_distances() to explicitly deal with missing calls. For each pair of diplotypes, this could be done by computing the mean distance at sites with non-missing calls in both diplotypes.
Modify the implementation of pca() to add a simple imputation of missing values if max_missing_an > 0, e.g., mean or most common.

This would allow a user to run pca(), plot_njt() and biallelic_snp_calls_to_plink() with the same parameters and know that exactly the same data had been used.

malariagen / malariagen-data-python

Standardise biallelic diplotypes and handling of missing calls #471