vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.08k stars 191 forks source link

VG call calling odds heterozygotes #2683

Open RenzoTale88 opened 4 years ago

RenzoTale88 commented 4 years ago

Hi, I've recently started to align short reads to a arthropod-sized graph genome (~250 Mb) created through cactus, aligning two different assemblies. I've aligned the short reads for the males, which are haploids, to the graph genome, and performed, alignment, filtering, chunking (-c 50 -s 999999999 -o 100000), augmenting of each chunk and calling. I do the call using the command:

vg call ./CHUNK/$sample/${chunk}.aug.xg -r ./CHUNK/$sample/${chunk}.aug.snarls -s $sample -k ./CHUNK/$sample/${chunk}.aug.pack -t $NSLOTS -p $chroms -l $clength -o $bpi | vcf-sort | bgzip -c > ./VCALL/$sample/${chunk}.vcf.gz

Everything runs just fine, with no errors all along. The samples have a low depth of coverage (~10X), and are all haploid. In this situation, I would've expected the variants to be mostly homozygotes for ref or alt, with the heterozygotes to be mostly low depth sites. But when I look at the variants I can see that some of them are called heterozygotes despite having the same coverage and support for the different alleles of others called as homozygotes:

NW_0205  242     6740388_6740391 A       C       178.07  PASS    DP=12   GT:DP:AD:GL:GQ:GP:XD:MAD        0/1:12:4,8:-19.252327,-1.922362,-9.938795:80:-1.098612:13.482759:4
NW_0205  262     6740396_6740399 T       G       203.279 PASS    DP=11   GT:DP:AD:GL:GQ:GP:XD:MAD        0/1:11:2,9:-22.474669,-2.623996,-6.040050:34:-1.098996:13.482759:2
NW_0205  264     6740399_6740402 C       G       175.825 PASS    DP=11   GT:DP:AD:GL:GQ:GP:XD:MAD        0/1:11:3,8:-19.252327,-2.146874,-8.068569:59:-1.098613:13.482759:3
NW_0205  266     6740402_6740405 A       G       203.279 PASS    DP=11   GT:DP:AD:GL:GQ:GP:XD:MAD        0/1:11:2,9:-22.474669,-2.623996,-6.040050:34:-1.098996:13.482759:2
NW_0205  276     6740405_6740408 G       A       199.982 PASS    DP=10   GT:DP:AD:GL:GQ:GP:XD:MAD        0/1:10:1,9:-22.458611,-2.953653,-4.375615:14:-1.135761:10.961379:1
NW_0205  279     6740408_6740411 TTT     AGC     222.709 PASS    DP=11   GT:DP:AD:GL:GQ:GP:XD:MAD        0/1:11:1,10:-25.141949,-3.370725,-4.642868:12:-1.150673:8.440000:1
NW_0205  283     6740411_6740414 G       C       222.709 PASS    DP=11   GT:DP:AD:GL:GQ:GP:XD:MAD        0/1:11:1,10:-25.141949,-3.370725,-4.642868:12:-1.150673:8.440000:1
NW_0205  284     6740414_6740416 TT      T       239.372 PASS    DP=11   GT:DP:AD:GL:GQ:GP:XD:MAD        1/1:11:1,10:-26.388223,-3.993862,-2.967147:10:-1.188484:8.440000:10

Is there some settings that I'm getting wrong? Thank you in advance Andrea

adamnovak commented 4 years ago

I think this is mostly expected behavior.

For the sites that are vaguely balanced, like the first one (4 reads ref, 8 reads alt), it makes sense to make a heterozygous call, because that's what the data suggests. Even for some of the unbalanced sites, you can argue for a het call, because at these low coverages the likelihood of being highly unbalanced can be more than the likelihood of a read error.

Those last two sites are definitely weird, though. Why should a SNP with 1 ref read and 10 alt reads be 0/1, while a deletion with 1 ref read and 10 alt reads is 1/1? Looking at the GLs, it clearly thinks homozygous alt (third number) is dramatically more consistent with the data (i.e. more likely, less negative) for the deletion than it is for the SNP, but it's not clear why. Base errors are more common than indel errors, so a single read supporting the reference should be more believable in the indel case. Maybe that's really a homopolymer, and the caller can tell that the alignments don't really distinguish between the possible lengths? Or maybe it's an artifact of how we integrate support for nodes and support for edges.

@glennhickey might be able to explain it.

glennhickey commented 4 years ago

I think we really need a way to specify haploid data to vg call. I will try to add an option this week (I think most of the code is there -- just need to expose it to the command line). These variants would certainly all be reported "1" by a haploid caller.

But for this, I agree with @adamnovak that the first 4 calls make sense. The remaining ones do seem a bit weird. I think part of the issue is that the AD fields reported in the VCF aren't exactly what was used to make the call due to rounding. Also, there are alleles that don't appear in the vcf that may come into play when nudging edges cases to one side and the other that that will affect the genotypes and GL values but otherwise be invisible in the VCF.

Still, I'm suspicious enough (having modified a lot of this logic recently) that if you're willing to share the data, I'll happily go through it to explain these results. Could be some hard coded parameters (baseline mapping error rate .5% , coverage window size 50) make more sense for the benchmarks I've used than this data. I can also use it as a test case for haploid calls.

RenzoTale88 commented 4 years ago

@glennhickey sure, I've just sent you the link through onedrive. The data are for one chromosome and one sample. There is the gam and the vg generated through vg chunk and the augmented data generated through vg augment, vg index -x, vg pack and vg snarls. All is done through vg 1.22.0.

adamnovak commented 4 years ago

Oh sorry, I missed that the organism is in fact supposed to be haploid. Our caller definitely assumes everything is diploid, and we don't currently have any machinery to enlighten it. @glennhickey if you want to add that that would be excellent! It would be great if we could support XY human karyotypes and maybe even tell it where the pseudo-autosomal region is (i.e. have a diploid region on an otherwise haploid contig).

glennhickey commented 4 years ago

OK, haploid calling is now available by setting ploidy to 1 (vg call -d 1). The calls with this option on this data should make much more sense.

Internally, the ploidy is a parameter when calling each snarl. So exact regional specification would be difficult to support. Having it set for all snarls that overlap a given pseudoautosomal region would be easy to add though.

The poisson model assigns a likelihood from the coverage of the alleles in the genotype, as well as the number of reads mapping to alleles not in the genotype. This second term is derived from an idea of the mapping error rate. At the moment, this value is hardcoded to something quite low .005, just because it was working out on a lot of the test data. But it is what's leading to cases like you're seeing where AD 1/9 is leading to 0/1 calls.

I'll try to see if there's something I can do to make this robust. The root problem is that the MAPQ information is lost from the GAM in vg pack.