sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
224 stars 32 forks source link

Add function to convert variant dataset to biallelic #627

Open jeromekelleher opened 3 years ago

jeromekelleher commented 3 years ago

A very common pattern in genetic analysis is to assume that there is exactly two alleles at each variant site. Datasets are then forced to be biallelic by conversion. I guess there's a few ways to do this:

  1. Throw away any sites that are not biallelic (probably no point in us implementing this because it's pretty simple to do?)
  2. Convert each site so that there are two alleles. We could keep the two most common alleles and treat everything else as the missing data, but there's probably other ways to do it too.
  3. Split the alleles to create more variants

Plink has the --biallelic-only option

Hail has the split_multi_hts method

@eric-czech, any thoughts here?

eric-czech commented 3 years ago

I think of option 3 as being the preferable solution for sure. I say that because, in human cohorts at least, multiallelic sites used to be more like 2% of all sites (so ignoring them was more common AFAICT) but are more like 10% as sample sizes have grown to 10k+, and some studies have shown disease associations with them [1]. Structural variants also hit allele counts of 100 or 1000+ in 1k Genomes [2] and not that you'd probably want to analyze them all individually, but it would be nice if they weren't truncated when loaded in so that someone could decide how to encode or group them later.

That said, I don' t think the majority of statgen users would complain if we only supported biallelic sites. We did already have a request for it though: https://github.com/pystatgen/sgkit/issues/541.

timothymillar commented 2 years ago

Somewhat related, this is a simple function to get the frequency of each alternate allele in "long form" with shape (variant loci * alts at loci, samples). This works for any ploidy or alt number. It also avoids needing to add pseudo-reference allele counts. Though I'm not sure if it suits your use cases.

def long_form_alt_frequencies(call_allele_frequency, variant_allele):
    keep = da.array(variant_allele != b'').compute()
    keep.T[0] = False
    af = da.moveaxis(da.asarray(call_allele_frequency), 1, 2)
    v, a, s = af.shape
    return af.reshape(v * a, s)[keep.ravel()]