samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
634 stars 241 forks source link

Don't change indel qual when indelQ == 0. #2121

Closed jkbonfield closed 3 months ago

jkbonfield commented 4 months ago

For an alignment that doesn't have an indel but is aligned against reads that do have an indel, the indel quality comes from the BAM quality. However we already have indelQ assigned, so this avoids changing to BAM qual if indelQ is zero as that is a special case for a read aligning to multiple indel "types" (lengths) with equal score.

This avoids excess AD numbers for poorly chosen alignments.

Fixes #2113

Benchmarks before and after on a single sample HG002. Identical for both as the change only affects multi-sample evaluation as it's changing scores when another sample has an indel but we do not.

SNP          Q>0 /   Q>=50 / Filtered
SNP   TP   71077 /   70956 /   70956
SNP   FP     766 /     296 /     293
SNP   GT      41 /      35 /      35
SNP   FN     310 /     431 /     431

InDel TP   11780 /   11709 /   11709
InDel FP     122 /      75 /      75
InDel GT      60 /      59 /      59
InDel FN     158 /     229 /     229

The same HG002 sample, but called in the context of HG003 and HG004 and then split apart again.

develop:

SNP          Q>0 /   Q>=50 / Filtered
SNP   TP   71125 /   71074 /    4899
SNP   FP    1215 /     663 /     216
SNP   GT      56 /      45 /      30
SNP   FN     262 /     313 /   66488

InDel TP   11805 /   11799 /    1195
InDel FP     342 /     323 /      51
InDel GT     278 /     277 /      63
InDel FN     133 /     139 /   10743

This PR:

SNP          Q>0 /   Q>=50 / Filtered
SNP   TP   71125 /   71074 /    4899
SNP   FP    1215 /     663 /     216
SNP   GT      56 /      45 /      30
SNP   FN     262 /     313 /   66488

InDel TP   11805 /   11799 /    1195
InDel FP     171 /     148 /      37
InDel GT     278 /     277 /      63
InDel FN     133 /     139 /   10743

No change to SNP obviously, and an approx halving of the FP rate. This likely corresponds to the change in AD calculations which previous gave false counting (for an apparently no gain in sensitivity).

Note in both cases, we're still better off not doing multi-sample calling if we want accuracy, which was a surprise.

pd3 commented 3 months ago

Thank you