Genotype call array to dosage

alimanfoo commented 4 years ago

Raising this issue to discuss implementing a function to convert a genotype call array to a dosage array.

In general, the output array should have at least two dimensions, with the first two dimensions being (variants, samples). The array elements give the dosage of each allele, i.e., how many copies of an allele are carried by the individual.

Some questions for discussion

How do we handle biallelic and multiallelic variants?
How do we handle missing genotype calls and avoid creating a bias towards a particular allele (e.g., the reference allele)?
What should the output dtype be? (int or float or either)? (Some programs think of dosage as a continuous variable.)

Some related functions (not necessarily what we want to copy, but for reference):

scikit-allel has to_n_ref, to_n_alt, to_allele_counts
skallel prototype has genotypes_3d_to_allele_counts, genotypes_3d_to_allele_counts_melt, genotypes_3d_to_major_allele_counts

alimanfoo commented 4 years ago

Perhaps this issue should be merged with #3?

tomwhite commented 4 years ago

In #38 I have created an implementation that

Is biallelic only
Represents missing calls with NaN in call/dosage and a corresponding mask in call/dosage_mask
Uses float types

hammer commented 4 years ago

Is biallelic only

I don't want to complicate things too early but as WES/WGS becomes more common than genotyping arrays, multiallelic sites become more common.

Should we strive for simplicity early (i.e. biallelic only) so that we can solve for other issues in library design now, or should we design for complexity up front (i.e. multiallelic sites) so that we don't have to revise things later?

I can see arguments for both approaches, so I'm happy to defer to the team's desires. I think @alimanfoo already works primarily with WGS data, so getting his insights would be valuable.

jeromekelleher commented 4 years ago

In general I think we need to assume everything is multiallelic, from the start. However, is it clear that we need to worry about this for dosages? This is a stupid question, but can we get dosage data for anything other than genotype arrays?

eric-czech commented 4 years ago

can we get dosage data for anything other than genotype arrays?

It's pretty common in GWAS to compute dosage based on genotype likelihood from WES/WGS for regression rather than hard calls, so that's one example at least.

jeromekelleher commented 4 years ago

OK, seems to me like we have to grasp the multi-allelic nettle from the start then.

alimanfoo commented 4 years ago

+1 for thinking multiallelic from the start, will make some entomologists very happy :-)

(And also +1 for thinking about any ploidy from the start - will hopefully make us some botanist friends :-))

alimanfoo commented 4 years ago

can we get dosage data for anything other than genotype arrays?

It's pretty common in GWAS to compute dosage based on genotype likelihood from WES/WGS for regression rather than hard calls, so that's one example at least.

I'm not an expert here, but I was wondering if we should consider separately these two different cases:

genotype call array -> genotype (per-call) allele counts array
genotype likelihoods array -> genotype dosage array

Genotype dosage is in general I believe modelled as a continuous variable, i.e., should have a float data type and can take any value between 0 and 2 (for diploid biallelic). I'm guessing genotype dosage is rarely calculated for multiallelic variants.

Perhaps this issue was badly named and/or should be split into two?

alimanfoo commented 4 years ago

Just to add, perhaps there are in fact three cases:

genotype call array -> genotype (per-call) allele counts array
genotype likelihoods array (from WES/WGS) -> genotype dosage array
genotype probabilities array (from genotyping chip + imputation) -> genotype dosage array

The last two cases may be structurally identical in terms of how we represent the data, but are conceptually slightly different and wanted to check these are all paths we want to cover?

alimanfoo commented 4 years ago

Also mentioning one other consideration, handling of missing data. This might not be the best place to discuss, but I'm conscious that converting genotype calls to some kind of allele dosage-like representation can induce a bias in the presence of missing data.

This may not ever come up when dealing with data from chip + imputation because there are no missing calls. But will come up when dealing with WES/WGS data.

The obvious thing to avoid is creating a bias towards the reference allele. E.g., if you assume biallelic sites, and you compute dosage of the alternate allele, then both a hom ref call and a missing call end up with the same dosage (0).

alimanfoo commented 4 years ago

Represents missing calls with NaN in call/dosage and a corresponding mask in call/dosage_mask

Should've read this before I made previous comment :-)

alimanfoo commented 4 years ago

Represents missing calls with NaN in call/dosage and a corresponding mask in call/dosage_mask

One question I will have at some point, is how do you deal with missing calls in PCA? A related question, how do you deal with multiallelic variants in PCA? That probably deserves it's own issue, but lodging the thought here for now.

eric-czech commented 3 years ago

A related question, how do you deal with multiallelic variants in PCA?

BTW the most common approach I see for dealing with multiallelic sites when starting from WES/WGS for PCA, kinship estimation, LD pruning, association tests, etc. is to split the sites as new biallelic variants and recode the calls/dosages. Is that something you've used before @alimanfoo? I was assuming we could support more alleles in a lot of analyses with something like a split_multiallelic_variants function that would make a dataset compatible with any biallelic-only functions.

timothymillar commented 3 years ago

Just to add a relevant (but very niche) example to the discussion: I'm currently working on Bayesian assembly of micro-haplotypes from polyploid NGS data. The output of this is a VCF in which each row/variant is a variable 'chunk' of the genome (~100bp) which are highly multi-allelic (often up to 20+ unique alleles). In addition to a genotype call, it also produces an 'expected dosage' which is a continuous value in the range [0, ploidy] for each unique allele in the genotype. The 'expected dosage' is calculated by weighting each possible integer dosage by it's posterior probability.

In short: +1 for polyploid, multi-allelic, continuous dosage (sorry)

hammer commented 3 years ago

@alimanfoo may want to do PCA w/ multi-allelic data (cf. #95) that could exercise the more complex representation.

Another path forward: enumerate the specific subtasks that need to be resolved for specific methods.

eric-czech commented 3 years ago

Sorry @alimanfoo I misunderstood you today when you were asking about using dosages in GWAS. I tend to think of them as dosage = fn(gentoype probabilities) and alt allele counts = fn(hard calls) but I was definitely mistaken in saying that we don't frequently use dosages if you consider alt allele counts to be the same thing.

I think this right on, or it matches how I think of these:

I was wondering if we should consider separately these two different cases: genotype call array -> genotype (per-call) allele counts array genotype likelihoods array -> genotype dosage array

Some thoughts on this:

I had assumed that both of these are equally applicable to the biallelic and multiallelic cases when you split multiallelic variants first. That seems to be what @timothymillar is describing, if I'm following that correctly.
I think that conceptual fork in a 3rd case (chip + imputation) you mentioned @alimanfoo is there, but afaik handling both would work with the same code
I now better understand what you're getting with reference bias on missing calls, and the answer to that in the workflows I've seen is to drop them and mean impute the values later (e.g. before PCA or regression). Typically it's a very small fraction of calls that need it after all the other QC in human data so it seems to be an accepted practice.

What do you guys think of this as a plan for moving forward:

We move the discussion about alt allele counting to https://github.com/pystatgen/sgkit/issues/85 (sorry, I had created this without remembering this issue)
- That function should handle polyploidy, multiple alleles, and missing values
- Polyploidy and missing values are pretty easy to handle IMO, assuming you can buy in to imputing missing values later
- Multi-allelic handling could be tricker but we could perhaps continue discussing specifics on that other issue
  - The approach I've seen to this is to count any non-reference allele and sum them up so which alt allele is called becomes irrelevant. Users can split the alleles first if they want separate dosages for each allele -- I'm not sure if we want that behavior or not but it would be a common thing to see in other GWAS software.
We maybe make an issue for the genotype likelihoods array -> genotype dosage array case? I'd love to be able to do this instead in our GWAS, but we already have it from bgen and I'm not worried about calculating it from WES/WGS for a bit.
We don't plan on doing anything special for the genotype probabilities array (from genotyping chip + imputation) -> genotype dosage array case. I think we can drop that one.

timothymillar commented 3 years ago

Sorry I think I have confused the nomenclature here. One of the harder problems in (auto-)polyploids is that the increased chromosome copy-number results in a high rate of heterozygous calls, but these calls are not all equal. For example the genotypes AAAB, AABB and ABBB are all heterozygous but not equivalent and often we can determine the unique alleles but not the count of each allele. In the polyploid literature these counts are often referred to as the 'allelic dosage' but this idea clearly maps to 'genotype (per-call) allele counts array' i.e. an integer array of shape (n_variants, n_samples, n_alleles). What I thought what you were calling a 'dosage array' was an equivalent shaped array (n_variants, n_samples, n_alleles) but with floats.

The dosage array that you are describing (n_variants, n_samples) has been used with biallelic SNPs for GWAS in polyploids under the name "continuous genotype values". As for splitting multiallelic variants into biallelic variants, this should work the same regardless of ploidy though I'm unsure of the implications it would have for down-stream analysis (I have limited knowledge of GWAS).

eric-czech commented 3 years ago

Thanks @timothymillar, makes sense. I do think of a dosage array with that definition but often ignore the alleles dimension so hopefully we're all circling around the same things now. Also that reference is great, thanks for that!

In general, I think of multi-allelic considerations as being between the Scylla of losing information on which variants are at the same locus (in the splitting / one-hot encoding case) and the Charybdis of trusting that any statistic that generalizes beyond the bi-allelic case can accurately reflect what impact those alternate alleles have relative to one another (few to none do afaik). I'll probably keep referring to splitting alleles as the solution to dealing with multiple of them because the latter case is much more important to me in our GWAS efforts. In other words, I'd gladly split the alleles rather than use some dosage calculation that equally weights two alternate alleles in a coding region where one is synonymous and the other results in a complete loss of function.

I'm assuming not splitting alleles will be more important in popgen stats (e.g. https://github.com/pystatgen/sgkit/issues/94). My hope is that we can generalize as many functions as possible to assume multiple alleles in support of those and then add flags or examples that show how to do the same in the bi-allelic case without losing much efficiency.

sgkit-dev / sgkit

Genotype call array to dosage #21