Closed jkbonfield closed 6 months ago
Oddly despite that HG002 is much more accurate than HG003 and HG004, with GT assignment error rates an order of magnitude higher. This PR makes them a bit higher still (maybe another 20%). I cannot explain either of these, but perhaps it's simply down to the accuracy of the truth set as HG002 is by far the most widely curated of the three. Either that or my analysis has a flaw somewhere
It would help if I compared HG003 against the HG003 truth set (and similarly for HG004). After that this is an improvement as expected. Phew. :)
In percents: Q>=1 / Q>=99 for 60x Illumina (HG002, as part of HG00[234])
=== 1.19 ===
InDel TP 98.89 / 98.74
InDel FP 2.82 / 2.50
InDel GT 2.35 / 2.35
InDel FN 1.11 / 1.26
=== indels-2.0 ===
InDel TP 98.63 / 98.54
InDel FP 1.44 / 1.31
InDel GT 2.22 / 2.20
InDel FN 1.37 / 1.46
=== devel ===
InDel TP 98.89 / 98.73
InDel FP 1.44 / 1.11
InDel GT 2.35 / 2.35
InDel FN 1.11 / 1.27
=== PR ===
InDel TP 98.87 / 98.70
InDel FP 1.25 / 0.96
InDel GT 0.51 / 0.50
InDel FN 1.13 / 1.30
This applies where there are zero observed indels in a sample (but they are in other samples). It's odd how this appears to be a pretty rare change, as we'd expect a significant change to FP, but it's tiny. Genotype assignment error changes a lot, but only in one of the 3 samples I tested.
On HG002, I see the following differences to calling rates.
This was HG002 called in conjunction with HG003 and HG004, but not as a trio (so no pedegree supplied). Oddly despite that HG002 is much more accurate than HG003 and HG004, with GT assignment error rates an order of magnitude higher. This PR makes them a bit higher still (maybe another 20%). I cannot explain either of these, but perhaps it's simply down to the accuracy of the truth set as HG002 is by far the most widely curated of the three. Either that or my analysis has a flaw somewhere.
Fixes #2130.