mpileup --indels-cns and varying indel sizes

samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html

Other

640 stars 241 forks source link

This is a fundamental problem to all calling modes I believe, but it's particularly bad with indels-cns as it's using edlib for computing an edit distance, which isn't an affine-weight gap cost and has a uniform +1 for mismatch, insertion and deletion (per base).

Consider this set of alignments.

We have REF (AAC), 2bp del (A**) and 3bp del (***). The 3bp del arrives first, and on assessment as it meets the specified minimum number of reads to trigger an evaluation, we check the gap starting at 320060. This gives a pair of consensus sequences for type -3 and type 0. (I think perhaps it should have a type -2 in there too, but that gap starts downstream so in the specific column it just has the two.)

We then align all the records against these two candidates, and all the A** align best to *** than AAC so we end up attributing those as a low quality confirmation of the 3bp deletion. Even though only 2 reads actually have a 3bp deletion!

chr20:3220060 AAAC A 0 // 21,18 
chr20:3220061 AAC A 0 // 20,19

If I reprocess the traceback produced by edlib and add another 10 per indel, and cap the maximum edit distance too, then aligning * to A and AAC both end up scoring the same thing (capped) and the alignment is rejected as not confirming either allele. That's a hacky way to solve this, but it works (with caveats in loss of evidence elsewhere):

chr20:3220060 AAAC A 0 // 17,6 
chr20:3220061 AAC A 0 // 17,13

Still not perfect, but enough to reject it in call.

Arguably we need the investigation at 320060 to call 3 possible alleles, for type -3, -2 (1bp downstream) and 0. I'm not sure if that's how indels-2.0 works (I'm guessing so as it gave AD 38,2 for 322060), or whether it's simply the BAQ scoring that nuked confidences down and made the alignments score 0. I fear simply rejecting alignments based on having any additional indel (which is essentially what my scoring hack did) will be hugely detrimental to pacbio, ont, ultima, etc.

I'm not sure how we get types computed based on future indels within the same STR, but it's probably the way we need to go for robustness.

Edit: then again if we had the third allele then we may potentially get this, which would be confusing (and wrong IMO):

chr20:3220060 AAAC AA
chr20:3220061 AAC  A

I've no idea if they'll be normalised away, but it shouldn't be calling something which isn't at that location. Really we need a third allele to soak up alignments, but not to be callable. So we evaluate against *, A* and AAC, allocating alignments to the 3 possibilities, and then emit the scores for candidate.

Did you do some nearby deletion detection or position normalisation step in your indels-2.0 Petr? How is it not tripping up on this case?

SNP Q>0 / Q>=50 / Filtered SNP TP 71077 / 70956 / 70956 SNP FP 766 / 296 / 293 SNP GT 41 / 35 / 35 SNP FN 310 / 431 / 431 InDel TP 11780 / 11709 / 11709 InDel FP 122 / 75 / 75 InDel GT 60 / 59 / 59 InDel FN 158 / 229 / 229

SNP Q>0 / Q>=50 / Filtered SNP TP 71077 / 70956 / 70956 SNP FP 766 / 296 / 293 SNP GT 41 / 35 / 35 SNP FN 310 / 431 / 431 InDel TP 11780 / 11709 / 11709 InDel FP 123 / 75 / 75 InDel GT 61 / 60 / 60 InDel FN 158 / 229 / 229

samtools / bcftools

mpileup --indels-cns and varying indel sizes #2117