malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Standardise biallelic diplotypes and handling of missing calls #471

Open alimanfoo opened 8 months ago

alimanfoo commented 8 months ago

We now have a function biallelic_diplotypes() which computes an array of shape (variants, samples) representing genotypes at biallelic sites.

The current implementation codes each call as an integer counting the count of the second allele. Missing calls do not receive any special coding, and so get a count of 0, which is the same value for a genotype homozygous for the first allele. This isn't ideal as it will cause a bias towards the reference allele in general, because this is most commonly the first allele. This bias can be minimised, by using the max_missing_an parameter to select only sites with low or zero missingness. However, in general it would be better to have missing calls represented in the data, and then explicitly dealt with in any functions that use the data.

Current functions downstream of biallelic_diplotypes() are pca() and plot_njt() (via biallelic_pairwise_distances()). Neither of these functions have any special handling of missing calls, and so would need to be adapted.

A planned function to be added in the next release is biallelic_snp_calls_to_plink() to write out data in plink binary format (#248). The convention for these data is to represent diplotypes as the count of the first allele (rather than the second) and to code missing calls as -127.

To allow for consistency across all functions downstream of biallelic_diplotypes() I propose to:

This would allow a user to run pca(), plot_njt() and biallelic_snp_calls_to_plink() with the same parameters and know that exactly the same data had been used.