Fix REF indel calls for multi-sample calling.

jkbonfield commented 6 months ago

This applies where there are zero observed indels in a sample (but they are in other samples). It's odd how this appears to be a pretty rare change, as we'd expect a significant change to FP, but it's tiny. Genotype assignment error changes a lot, but only in one of the 3 samples I tested.

On HG002, I see the following differences to calling rates.

Previous     All     QUAL>=100
InDel TP   11823 /   11798
InDel FP    5374 /    4898
InDel GT     296 /     291
InDel FN     115 /     140

New          All     QUAL>=100
InDel TP   11822 /   11795
InDel FP    5313 /    4805
InDel GT      80 /      74
InDel FN     116 /     143

This was HG002 called in conjunction with HG003 and HG004, but not as a trio (so no pedegree supplied). Oddly despite that HG002 is much more accurate than HG003 and HG004, with GT assignment error rates an order of magnitude higher. This PR makes them a bit higher still (maybe another 20%). I cannot explain either of these, but perhaps it's simply down to the accuracy of the truth set as HG002 is by far the most widely curated of the three. Either that or my analysis has a flaw somewhere.

Fixes #2130.

jkbonfield commented 6 months ago

Oddly despite that HG002 is much more accurate than HG003 and HG004, with GT assignment error rates an order of magnitude higher. This PR makes them a bit higher still (maybe another 20%). I cannot explain either of these, but perhaps it's simply down to the accuracy of the truth set as HG002 is by far the most widely curated of the three. Either that or my analysis has a flaw somewhere

It would help if I compared HG003 against the HG003 truth set (and similarly for HG004). After that this is an improvement as expected. Phew. :)

jkbonfield commented 6 months ago

In percents: Q>=1 / Q>=99 for 60x Illumina (HG002, as part of HG00[234])

=== 1.19 ===                                                                    
InDel TP   98.89 /   98.74                                                      
InDel FP    2.82 /    2.50                                                      
InDel GT    2.35 /    2.35                                                      
InDel FN    1.11 /    1.26                                                      

=== indels-2.0 ===                                                              
InDel TP   98.63 /   98.54                                                      
InDel FP    1.44 /    1.31                                                      
InDel GT    2.22 /    2.20                                                      
InDel FN    1.37 /    1.46                                                      

=== devel ===                                                                   
InDel TP   98.89 /   98.73                                                      
InDel FP    1.44 /    1.11                                                      
InDel GT    2.35 /    2.35                                                      
InDel FN    1.11 /    1.27                                                      

=== PR ===                                                                      
InDel TP   98.87 /   98.70                                                      
InDel FP    1.25 /    0.96                                                      
InDel GT    0.51 /    0.50                                                      
InDel FN    1.13 /    1.30

samtools / bcftools

Fix REF indel calls for multi-sample calling. #2132