sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
221 stars 32 forks source link

Genotype call array to dosage #21

Open alimanfoo opened 4 years ago

alimanfoo commented 4 years ago

Raising this issue to discuss implementing a function to convert a genotype call array to a dosage array.

In general, the output array should have at least two dimensions, with the first two dimensions being (variants, samples). The array elements give the dosage of each allele, i.e., how many copies of an allele are carried by the individual.

Some questions for discussion

Some related functions (not necessarily what we want to copy, but for reference):

alimanfoo commented 4 years ago

Perhaps this issue should be merged with #3?

tomwhite commented 4 years ago

In #38 I have created an implementation that

hammer commented 4 years ago

Is biallelic only

I don't want to complicate things too early but as WES/WGS becomes more common than genotyping arrays, multiallelic sites become more common.

Should we strive for simplicity early (i.e. biallelic only) so that we can solve for other issues in library design now, or should we design for complexity up front (i.e. multiallelic sites) so that we don't have to revise things later?

I can see arguments for both approaches, so I'm happy to defer to the team's desires. I think @alimanfoo already works primarily with WGS data, so getting his insights would be valuable.

jeromekelleher commented 4 years ago

In general I think we need to assume everything is multiallelic, from the start. However, is it clear that we need to worry about this for dosages? This is a stupid question, but can we get dosage data for anything other than genotype arrays?

eric-czech commented 4 years ago

can we get dosage data for anything other than genotype arrays?

It's pretty common in GWAS to compute dosage based on genotype likelihood from WES/WGS for regression rather than hard calls, so that's one example at least.

jeromekelleher commented 4 years ago

OK, seems to me like we have to grasp the multi-allelic nettle from the start then.

alimanfoo commented 4 years ago

+1 for thinking multiallelic from the start, will make some entomologists very happy :-)

(And also +1 for thinking about any ploidy from the start - will hopefully make us some botanist friends :-))

alimanfoo commented 4 years ago

can we get dosage data for anything other than genotype arrays?

It's pretty common in GWAS to compute dosage based on genotype likelihood from WES/WGS for regression rather than hard calls, so that's one example at least.

I'm not an expert here, but I was wondering if we should consider separately these two different cases:

Genotype dosage is in general I believe modelled as a continuous variable, i.e., should have a float data type and can take any value between 0 and 2 (for diploid biallelic). I'm guessing genotype dosage is rarely calculated for multiallelic variants.

Perhaps this issue was badly named and/or should be split into two?

alimanfoo commented 4 years ago

Just to add, perhaps there are in fact three cases:

The last two cases may be structurally identical in terms of how we represent the data, but are conceptually slightly different and wanted to check these are all paths we want to cover?

alimanfoo commented 4 years ago

Also mentioning one other consideration, handling of missing data. This might not be the best place to discuss, but I'm conscious that converting genotype calls to some kind of allele dosage-like representation can induce a bias in the presence of missing data.

This may not ever come up when dealing with data from chip + imputation because there are no missing calls. But will come up when dealing with WES/WGS data.

The obvious thing to avoid is creating a bias towards the reference allele. E.g., if you assume biallelic sites, and you compute dosage of the alternate allele, then both a hom ref call and a missing call end up with the same dosage (0).

alimanfoo commented 4 years ago
  • Represents missing calls with NaN in call/dosage and a corresponding mask in call/dosage_mask

Should've read this before I made previous comment :-)

alimanfoo commented 4 years ago
  • Represents missing calls with NaN in call/dosage and a corresponding mask in call/dosage_mask

One question I will have at some point, is how do you deal with missing calls in PCA? A related question, how do you deal with multiallelic variants in PCA? That probably deserves it's own issue, but lodging the thought here for now.

eric-czech commented 3 years ago

A related question, how do you deal with multiallelic variants in PCA?

BTW the most common approach I see for dealing with multiallelic sites when starting from WES/WGS for PCA, kinship estimation, LD pruning, association tests, etc. is to split the sites as new biallelic variants and recode the calls/dosages. Is that something you've used before @alimanfoo? I was assuming we could support more alleles in a lot of analyses with something like a split_multiallelic_variants function that would make a dataset compatible with any biallelic-only functions.

timothymillar commented 3 years ago

Just to add a relevant (but very niche) example to the discussion: I'm currently working on Bayesian assembly of micro-haplotypes from polyploid NGS data. The output of this is a VCF in which each row/variant is a variable 'chunk' of the genome (~100bp) which are highly multi-allelic (often up to 20+ unique alleles). In addition to a genotype call, it also produces an 'expected dosage' which is a continuous value in the range [0, ploidy] for each unique allele in the genotype. The 'expected dosage' is calculated by weighting each possible integer dosage by it's posterior probability.

In short: +1 for polyploid, multi-allelic, continuous dosage (sorry)

hammer commented 3 years ago

@alimanfoo may want to do PCA w/ multi-allelic data (cf. #95) that could exercise the more complex representation.

Another path forward: enumerate the specific subtasks that need to be resolved for specific methods.

eric-czech commented 3 years ago

Sorry @alimanfoo I misunderstood you today when you were asking about using dosages in GWAS. I tend to think of them as dosage = fn(gentoype probabilities) and alt allele counts = fn(hard calls) but I was definitely mistaken in saying that we don't frequently use dosages if you consider alt allele counts to be the same thing.

I think this right on, or it matches how I think of these:

I was wondering if we should consider separately these two different cases: genotype call array -> genotype (per-call) allele counts array genotype likelihoods array -> genotype dosage array

Some thoughts on this:

What do you guys think of this as a plan for moving forward:

timothymillar commented 3 years ago

Sorry I think I have confused the nomenclature here. One of the harder problems in (auto-)polyploids is that the increased chromosome copy-number results in a high rate of heterozygous calls, but these calls are not all equal. For example the genotypes AAAB, AABB and ABBB are all heterozygous but not equivalent and often we can determine the unique alleles but not the count of each allele. In the polyploid literature these counts are often referred to as the 'allelic dosage' but this idea clearly maps to 'genotype (per-call) allele counts array' i.e. an integer array of shape (n_variants, n_samples, n_alleles). What I thought what you were calling a 'dosage array' was an equivalent shaped array (n_variants, n_samples, n_alleles) but with floats.

The dosage array that you are describing (n_variants, n_samples) has been used with biallelic SNPs for GWAS in polyploids under the name "continuous genotype values". As for splitting multiallelic variants into biallelic variants, this should work the same regardless of ploidy though I'm unsure of the implications it would have for down-stream analysis (I have limited knowledge of GWAS).

eric-czech commented 3 years ago

Thanks @timothymillar, makes sense. I do think of a dosage array with that definition but often ignore the alleles dimension so hopefully we're all circling around the same things now. Also that reference is great, thanks for that!

In general, I think of multi-allelic considerations as being between the Scylla of losing information on which variants are at the same locus (in the splitting / one-hot encoding case) and the Charybdis of trusting that any statistic that generalizes beyond the bi-allelic case can accurately reflect what impact those alternate alleles have relative to one another (few to none do afaik). I'll probably keep referring to splitting alleles as the solution to dealing with multiple of them because the latter case is much more important to me in our GWAS efforts. In other words, I'd gladly split the alleles rather than use some dosage calculation that equally weights two alternate alleles in a coding region where one is synonymous and the other results in a complete loss of function.

I'm assuming not splitting alleles will be more important in popgen stats (e.g. https://github.com/pystatgen/sgkit/issues/94). My hope is that we can generalize as many functions as possible to assume multiple alleles in support of those and then add flags or examples that show how to do the same in the bi-allelic case without losing much efficiency.