macarthur-lab / gnomad_browser

gnomAD browser pre-ASHG 2018
MIT License
33 stars 16 forks source link

Inconsistency with AB_HIST_ALT and some multiallelic sites? #84

Closed koelling closed 6 years ago

koelling commented 6 years ago

I encountered a weird issue that I wanted to check with you regarding this site: http://gnomad.broadinstitute.org/variant/7-74306893-A-G

This site is reported as multiallelic (A/G and A/T) on the website, with A/G found in many exomes/genomes and A/T only having an AC of 1 in exomes. Consequently, only the A/G variant is reported in the genome VCF (gnomad.genomes.r2.0.2.sites.coding_only.chr1-22.vcf.bgz).

However, when I look at the allele balance histogram, it seems to give the data for the other ALT allele, with only a single data point (implying AC = 1). The same is true for AB_HIST_ALL.

AB_HIST_ALT=0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1

Is it possible that there is some sort of strange bug with multiallelics, where you might end up extracting the wrong value for AB_HIST_ALT if one of the alt alleles is removed from the final output VCF? I would imagine something like always taking the values from the first alt allele in the VCF, even if the kept allele was the second one.

Of course, another question would be why this site is included in the first place (since it isn't annotated as coding), but that's not a big deal I think.

Thank you!

konradjk commented 6 years ago

You found an interesting case, but I don't think it's a bug. The AB_HISTs only include samples that were called as heterozygous, and since this is an AF ~ 1 variant, nearly all the samples were homozygous.

In that case, there's a separate question of why there's a heterozygote at all, but there was one whose genotype was filtered out (on the basis of GQ, DP, or AB), which you can see in the *_raw annotations:

GC_raw=0,1,15453;AC_raw=30907;AN_raw=30908

The histograms include these samples that were filtered out (hence why you have some histograms with samples at DP < 10, GQ < 20, or AB < 0.2).

As to why the site was included at all, that's a good question: it must have been in one of the intervals files we used but I'm not sure why.

koelling commented 6 years ago

Oh I see, that makes a lot of sense. Thank you!

So just to confirm, sum(AB_HIST_ALT) should always be the same as the unfiltered number of Hets as given in the second item in GC_raw?

And how does this field work for multi-allelic sites? It looks like the GC field contains 6 values for triallelic sites, which are presumably all possible combinations of ref, alt1 and alt2. I think the order is 0/0, 0/1, 1/1, 0/2, 1/2, 2/2 but I would like to make sure I am interpreting this correctly. In particular, I am unsure where 1/2 goes in this list.

konradjk commented 6 years ago

So just to confirm, sum(AB_HIST_ALT) should always be the same as the unfiltered number of Hets as given in the second item in GC_raw?

Yes, I believe that's correct, though I haven't thought through if there are any edge cases.

And how does this field work for multi-allelic sites? It looks like the GC field contains 6 values for triallelic sites, which are presumably all possible combinations of ref, alt1 and alt2. I think the order is 0/0, 0/1, 1/1, 0/2, 1/2, 2/2 but I would like to make sure I am interpreting this correctly. In particular, I am unsure where 1/2 goes in this list.

Yes, that's correct. This corresponds to a Number=G array, which is described (albeit a bit hidden) in the VCF spec:

If A is the allele in
REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by:
F(j/k) = (k*(k+1)/2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the
ordering is: AA,AB,BB,AC,BC,CC, etc. For example: GT:GL 0/1:-323.03,-99.29,-802.53 (Floats)
koelling commented 6 years ago

Perfect, thank you very much!