Biallelic SNP calls - Githubissues

alimanfoo commented 9 months ago

Several popgen analyses either require or are easier to implement with biallelic SNPs only. E.g., our current pca() function locates biallelic SNPs first before running the PCA. Anticipating several other functions which will want to locate biallelic SNPs in the same way, I propose to pull out a function to do this.

Proposed signature:

def snp_calls_biallelic(
    region,
    sample_sets,
    sample_query,
    site_mask,
    max_missing_an,
    min_minor_ac,
    n_snps,
    thin_offset,
) -> xr.Dataset:

The max_missing_an, min_minor_ac, n_snps and thin_offset parameters would have the same semantics as parameters in the current pca() function.

The return value would be an xarray Dataset with the following variables:

sample_id
variant_contig
variant_position
variant_allele
call_genotype

The data would be transformed to provide the simplest representation of biallelic SNPs, so the variant_allele array would have 2 columns, and the maximum value in the call_genotype array would be 1.

alimanfoo commented 9 months ago

Just to add, this is particularly useful for population structure analyses, e.g., PCA, neighbour-joining trees, nearest-neighbour graphs and admixture could all depend on this same function for generating input data.

alimanfoo commented 9 months ago

Initial draft implementation here.

malariagen / malariagen-data-python

Biallelic SNP calls #448