samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
640 stars 241 forks source link

mpileup --indels-cns and varying indel sizes #2117

Closed jkbonfield closed 4 months ago

jkbonfield commented 4 months ago

This is a fundamental problem to all calling modes I believe, but it's particularly bad with indels-cns as it's using edlib for computing an edit distance, which isn't an affine-weight gap cost and has a uniform +1 for mismatch, insertion and deletion (per base).

Consider this set of alignments.

image

We have REF (AAC), 2bp del (A**) and 3bp del (***). The 3bp del arrives first, and on assessment as it meets the specified minimum number of reads to trigger an evaluation, we check the gap starting at 320060. This gives a pair of consensus sequences for type -3 and type 0. (I think perhaps it should have a type -2 in there too, but that gap starts downstream so in the specific column it just has the two.)

We then align all the records against these two candidates, and all the A** align best to *** than AAC so we end up attributing those as a low quality confirmation of the 3bp deletion. Even though only 2 reads actually have a 3bp deletion!

chr20:3220060 AAAC A 0 // 21,18 
chr20:3220061 AAC A 0 // 20,19 

If I reprocess the traceback produced by edlib and add another 10 per indel, and cap the maximum edit distance too, then aligning * to A and AAC both end up scoring the same thing (capped) and the alignment is rejected as not confirming either allele. That's a hacky way to solve this, but it works (with caveats in loss of evidence elsewhere):

chr20:3220060 AAAC A 0 // 17,6 
chr20:3220061 AAC A 0 // 17,13 

Still not perfect, but enough to reject it in call.

Arguably we need the investigation at 320060 to call 3 possible alleles, for type -3, -2 (1bp downstream) and 0. I'm not sure if that's how indels-2.0 works (I'm guessing so as it gave AD 38,2 for 322060), or whether it's simply the BAQ scoring that nuked confidences down and made the alignments score 0. I fear simply rejecting alignments based on having any additional indel (which is essentially what my scoring hack did) will be hugely detrimental to pacbio, ont, ultima, etc.

I'm not sure how we get types computed based on future indels within the same STR, but it's probably the way we need to go for robustness.

Edit: then again if we had the third allele then we may potentially get this, which would be confusing (and wrong IMO):

chr20:3220060 AAAC AA
chr20:3220061 AAC  A

I've no idea if they'll be normalised away, but it shouldn't be calling something which isn't at that location. Really we need a third allele to soak up alignments, but not to be callable. So we evaluate against *, A* and AAC, allocating alignments to the 3 possibilities, and then emit the scores for candidate.

Did you do some nearby deletion detection or position normalisation step in your indels-2.0 Petr? How is it not tripping up on this case?

jkbonfield commented 4 months ago

So I see one way of avoiding this, without a preemptive filtering based on downstream indels within a STR (which is complex and liable to other mistakes).

We compute two consensus alleles for each "type", so 2 for type -3 and 2 for type 0, assuming there's a difference. Type -3 only has one choice as it occurs in just 2 reads and they match. Type 0 potentially has 2 alternatives, one with AAACAAAC and one with AAAC. If it picks both of these, then the AAAAC reads confirm type 0 and the type -3 becomes AD 34,2. If we don't then the AA**AAC end up confirming the type -3 and we get AD 21,18.

So, why wouldn't we get two consensus sequences for type 0? That comes down to CONS_CUTOFF used here. In this case, we have 31% of the total being the 2bp deletion, which is below 40 so it's not selected. Sure enough modifying that down to 30 fixes this problem. It creates problems elsewhere though (although it's marginally better at 30). I guess there's a reason why I chose 40 for that number and it was almost certainly experimentally derived.

I do wonder though if it's the wrong ratio. It's 40% of the total, whereas it ought to be 40% of the type 0 data, and even then perhaps of the type 0 data that spans (so ignoring those that don't extend either side of the homopolymer).

Anyway, I left CONS_CUTOFF as is (as it's used for SNPs too) and made CONS_CUTOFF_DEL for that specific issue, and modified both DEL and INC cutoffs from 40% to 30% and 20% for evaluation.

At 40%

SNP          Q>0 /   Q>=50 / Filtered
SNP   TP   71077 /   70956 /   70956
SNP   FP     766 /     296 /     293
SNP   GT      41 /      35 /      35
SNP   FN     310 /     431 /     431

InDel TP   11780 /   11709 /   11709
InDel FP     122 /      75 /      75
InDel GT      60 /      59 /      59
InDel FN     158 /     229 /     229

At 30%

SNP          Q>0 /   Q>=50 / Filtered
SNP   TP   71077 /   70956 /   70956
SNP   FP     766 /     296 /     293
SNP   GT      41 /      35 /      35
SNP   FN     310 /     431 /     431

InDel TP   11780 /   11709 /   11709
InDel FP     123 /      75 /      75
InDel GT      61 /      60 /      60
InDel FN     158 /     229 /     229

At 20%

InDel TP   11775 /   11705 /   11705
InDel FP     114 /      75 /      75
InDel GT      70 /      69 /      69
InDel FN     163 /     233 /     233

So by 20% you can see it's reducing false positives, but increasing false negatives and genotype assignment errors. At 30% vs 40% it's basically neck and neck - we fix some, and we make other worse, but it's basically break even. So fixing this specific observation here isn't necessarily the right fix. I'll explore more about different ratios.