Closed alimanfoo closed 9 months ago
Just to add, this is particularly useful for population structure analyses, e.g., PCA, neighbour-joining trees, nearest-neighbour graphs and admixture could all depend on this same function for generating input data.
Several popgen analyses either require or are easier to implement with biallelic SNPs only. E.g., our current pca() function locates biallelic SNPs first before running the PCA. Anticipating several other functions which will want to locate biallelic SNPs in the same way, I propose to pull out a function to do this.
Proposed signature:
The
max_missing_an
,min_minor_ac
,n_snps
andthin_offset
parameters would have the same semantics as parameters in the current pca() function.The return value would be an xarray Dataset with the following variables:
The data would be transformed to provide the simplest representation of biallelic SNPs, so the variant_allele array would have 2 columns, and the maximum value in the call_genotype array would be 1.