Open alimanfoo opened 4 years ago
Perhaps this issue should be merged with #3?
In #38 I have created an implementation that
call/dosage
and a corresponding mask in call/dosage_mask
Is biallelic only
I don't want to complicate things too early but as WES/WGS becomes more common than genotyping arrays, multiallelic sites become more common.
Should we strive for simplicity early (i.e. biallelic only) so that we can solve for other issues in library design now, or should we design for complexity up front (i.e. multiallelic sites) so that we don't have to revise things later?
I can see arguments for both approaches, so I'm happy to defer to the team's desires. I think @alimanfoo already works primarily with WGS data, so getting his insights would be valuable.
In general I think we need to assume everything is multiallelic, from the start. However, is it clear that we need to worry about this for dosages? This is a stupid question, but can we get dosage data for anything other than genotype arrays?
can we get dosage data for anything other than genotype arrays?
It's pretty common in GWAS to compute dosage based on genotype likelihood from WES/WGS for regression rather than hard calls, so that's one example at least.
OK, seems to me like we have to grasp the multi-allelic nettle from the start then.
+1 for thinking multiallelic from the start, will make some entomologists very happy :-)
(And also +1 for thinking about any ploidy from the start - will hopefully make us some botanist friends :-))
can we get dosage data for anything other than genotype arrays?
It's pretty common in GWAS to compute dosage based on genotype likelihood from WES/WGS for regression rather than hard calls, so that's one example at least.
I'm not an expert here, but I was wondering if we should consider separately these two different cases:
Genotype dosage is in general I believe modelled as a continuous variable, i.e., should have a float data type and can take any value between 0 and 2 (for diploid biallelic). I'm guessing genotype dosage is rarely calculated for multiallelic variants.
Perhaps this issue was badly named and/or should be split into two?
Just to add, perhaps there are in fact three cases:
The last two cases may be structurally identical in terms of how we represent the data, but are conceptually slightly different and wanted to check these are all paths we want to cover?
Also mentioning one other consideration, handling of missing data. This might not be the best place to discuss, but I'm conscious that converting genotype calls to some kind of allele dosage-like representation can induce a bias in the presence of missing data.
This may not ever come up when dealing with data from chip + imputation because there are no missing calls. But will come up when dealing with WES/WGS data.
The obvious thing to avoid is creating a bias towards the reference allele. E.g., if you assume biallelic sites, and you compute dosage of the alternate allele, then both a hom ref call and a missing call end up with the same dosage (0).
- Represents missing calls with NaN in
call/dosage
and a corresponding mask incall/dosage_mask
Should've read this before I made previous comment :-)
- Represents missing calls with NaN in
call/dosage
and a corresponding mask incall/dosage_mask
One question I will have at some point, is how do you deal with missing calls in PCA? A related question, how do you deal with multiallelic variants in PCA? That probably deserves it's own issue, but lodging the thought here for now.
A related question, how do you deal with multiallelic variants in PCA?
BTW the most common approach I see for dealing with multiallelic sites when starting from WES/WGS for PCA, kinship estimation, LD pruning, association tests, etc. is to split the sites as new biallelic variants and recode the calls/dosages. Is that something you've used before @alimanfoo? I was assuming we could support more alleles in a lot of analyses with something like a split_multiallelic_variants
function that would make a dataset compatible with any biallelic-only functions.
Just to add a relevant (but very niche) example to the discussion: I'm currently working on Bayesian assembly of micro-haplotypes from polyploid NGS data. The output of this is a VCF in which each row/variant is a variable 'chunk' of the genome (~100bp) which are highly multi-allelic (often up to 20+ unique alleles). In addition to a genotype call, it also produces an 'expected dosage' which is a continuous value in the range [0, ploidy] for each unique allele in the genotype. The 'expected dosage' is calculated by weighting each possible integer dosage by it's posterior probability.
In short: +1 for polyploid, multi-allelic, continuous dosage (sorry)
@alimanfoo may want to do PCA w/ multi-allelic data (cf. #95) that could exercise the more complex representation.
Another path forward: enumerate the specific subtasks that need to be resolved for specific methods.
Sorry @alimanfoo I misunderstood you today when you were asking about using dosages in GWAS. I tend to think of them as dosage = fn(gentoype probabilities) and alt allele counts = fn(hard calls) but I was definitely mistaken in saying that we don't frequently use dosages if you consider alt allele counts to be the same thing.
I think this right on, or it matches how I think of these:
I was wondering if we should consider separately these two different cases: genotype call array -> genotype (per-call) allele counts array genotype likelihoods array -> genotype dosage array
Some thoughts on this:
What do you guys think of this as a plan for moving forward:
genotype likelihoods array -> genotype dosage array
case? I'd love to be able to do this instead in our GWAS, but we already have it from bgen and I'm not worried about calculating it from WES/WGS for a bit.genotype probabilities array (from genotyping chip + imputation) -> genotype dosage array
case. I think we can drop that one.Sorry I think I have confused the nomenclature here. One of the harder problems in (auto-)polyploids is that the increased chromosome copy-number results in a high rate of heterozygous calls, but these calls are not all equal. For example the genotypes AAAB, AABB and ABBB are all heterozygous but not equivalent and often we can determine the unique alleles but not the count of each allele. In the polyploid literature these counts are often referred to as the 'allelic dosage' but this idea clearly maps to 'genotype (per-call) allele counts array' i.e. an integer array of shape (n_variants, n_samples, n_alleles). What I thought what you were calling a 'dosage array' was an equivalent shaped array (n_variants, n_samples, n_alleles) but with floats.
The dosage array that you are describing (n_variants, n_samples) has been used with biallelic SNPs for GWAS in polyploids under the name "continuous genotype values". As for splitting multiallelic variants into biallelic variants, this should work the same regardless of ploidy though I'm unsure of the implications it would have for down-stream analysis (I have limited knowledge of GWAS).
Thanks @timothymillar, makes sense. I do think of a dosage array with that definition but often ignore the alleles dimension so hopefully we're all circling around the same things now. Also that reference is great, thanks for that!
In general, I think of multi-allelic considerations as being between the Scylla of losing information on which variants are at the same locus (in the splitting / one-hot encoding case) and the Charybdis of trusting that any statistic that generalizes beyond the bi-allelic case can accurately reflect what impact those alternate alleles have relative to one another (few to none do afaik). I'll probably keep referring to splitting alleles as the solution to dealing with multiple of them because the latter case is much more important to me in our GWAS efforts. In other words, I'd gladly split the alleles rather than use some dosage calculation that equally weights two alternate alleles in a coding region where one is synonymous and the other results in a complete loss of function.
I'm assuming not splitting alleles will be more important in popgen stats (e.g. https://github.com/pystatgen/sgkit/issues/94). My hope is that we can generalize as many functions as possible to assume multiple alleles in support of those and then add flags or examples that show how to do the same in the bi-allelic case without losing much efficiency.
Raising this issue to discuss implementing a function to convert a genotype call array to a dosage array.
In general, the output array should have at least two dimensions, with the first two dimensions being (variants, samples). The array elements give the dosage of each allele, i.e., how many copies of an allele are carried by the individual.
Some questions for discussion
Some related functions (not necessarily what we want to copy, but for reference):