revision of 1679 plus switch to edlib over BAQ

samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html

http://samtools.github.io/bcftools/

Other

649 stars 240 forks source link

revision of 1679 plus switch to edlib over BAQ #2045

Closed jkbonfield closed 7 months ago

jkbonfield commented 9 months ago

Please give me confirmation that this is interesting and will be considered for inclusion

Despite demonstrating a very significant leap forward in calling accuracy, #1679 never received any comments or consideration. Instead it triggered the creation of the --indels-2.0 option instead, with many of the ideas being rewritten for that. Unfortunately --indels-2.0 just doesn't perform that well, as it wasn't completed and misses many of the viral heuristics that remove the false positive calls.

In personal communications, one reason for 1679 being passed over appears to be because it changes the behaviour of existing bcftools and doesn't offer a way back to the old calling algorithm for compatibility. (IMO if you want it to never get better, don't upgrade!)

I'm not a believe in only locking new functionality behind new options, as that means we never progress (no one is going to go through all the nextflow pipelines and amend them), but I'll concede the lack of compatibility may be an issue so this PR adds versioned profiles (eg -X illumina-1.18 that behave as the previous release, while updating the default profiles to be more accurate.

The main summary of changes so far are:

Use of a consensus instead of patching the reference for constructing the candidate indel alleles to align against. This is the most significant accuracy gain from #1679.
Drop the use of BAQ as an alignment algorithm and replace with edlib. BAQ was never really designed as an aligner, but a quality recalibration too. Using it to align sequences is both very heavy weight and also error prone as it won't always give a good alignment score on things that have STRs or other low-complexity. Edlib is vastly faster, and we compensate for the lack of quality assessment by a combination of heuristics such as STR scanning (this was previously added, but it's been beefed up a bit more) and some crude quality score evaluations such as minimum quality within an STR.
More profiles and revised profiles. ONT-super and Ultima have been added.
New parameters: --del-bias to control biases in insertion vs deletion error models in instruments (many single cell instruments are more likely to miss bases than insert bases); --poly-mqual to enable/disable the use of minimum quality values within homopolmers (harms clocked instruments such as Illumina, but benefits many others).

SNP calling is unchanged.

As requested, I moved the code from out of bam2bcf_indels.c and into its own file. I'm not happy with the --edlib parameter as I feel edlib vs BAQ isn't the really important thing here. It's a big speed up, but the real work is the better algorithm. That's (for now) in bam2bcf_edlib.c, following the same strategy @pd3 took with duplicating functions into bam2bcf_iaux.c. I'm open to better names on the command line. I'm tempted by --indels-3.0, but I fear that may cause friction. I'm not intending to say it's deprecating your indels-2.0 code (it's for the users to decide which algorithms they prefer), and indeed it's not a derivation of it as it preceeds it and the indels-2.0 code was your earlier rewrite of my own 1679 PR so really that was "indels-1.5" anyway. I don't really like the naming scheme for either of them, but am open to suggestions.

The recall vs precision is generally significantly ahead of both develop and the experimental --indels-2.0 mode (which also crashes on a lot of the long-read data sets I tried, such as ~jkb/lustre/GIAB/ont/PAO83395.pass.cram on our local systems). Speed wise, it's pretty much unchanged with Illumina, but on long read technologies with higher indel errors the increased number of candidate indels to asses means the new profiles and edlib use can be 2-3x faster.

This PR however isn't always quite matching the old 1679 one. It's close, but marginally behind. That's probably due to using heuristics instead of an HMM for incorporation of quality scores to the alignment process. It's close enough though and it is much faster on long read data.

I'm still running experiments and assessing results, but this is a proof of principle.

Illumina GIAB HG002 chr1 results. These all had BAQ semi-enabled (default mode) for SNPs.

Total INDEL on chr1 = 45604
2% FN = 912
5% FN = 2280

Illumina:       2% FN          5% FN
    FP,GT,FN    Q/FP,GT,FN
devel   294,1078,762    23/252,1068,910    100/175,966,2268 42m38s
#1679    341,188,488    75/236,170,910     135/189,151,2254 77m4s
--id2.0  517,864,700    62/411/833/911     154/334,651,2290 63m48s
edlib    307,223,649    53/234,210,913     113/181,177,2286     60m25s
ed10     384,225,592    79/252,205,908     134/213,182,2263     62m44s

The comma-separated values above are False Positives, GT assignment errors, and False Negatives. The val in val/FP,GT,FN is the QUAL threshold applied as a filter, dropping all variants with lower quality. I've been manually adjusting that until I get to a similar FN rate so I can do a more reasonable side by side assessment of "given the same recall, how does the accuracy compare". I'll produce some plots later on once I've completed my trials.

"ed10" is --edlib --indel-bias 10 to push sensitivity up higher, especially at QUAL 0 (no filtering). It's still not as sensitive as 1679 though. 1679 was better here for genotype assignment accuracy, but FP vs FN was broadly similar to the new mode.

PacBio HiFi

                           ~1500FN              5% FN                                   
dev     25310,6014,1332     6/20532,6064,1491   23/ 9815,5986,2288      111m50s         
#1679   28891,5622, 778    26/10704,5608,1504   38/ 6466,5567,2329      47m             
edlib    9392,5406,1417     4/ 8624,5397,1496   13/ 5462,5342,2255      38m55s          
gatk   118211,5559, 997    90/29501,5541,1503  130/16044,5512,2294      ~944min          
--id2.0 crash

The new edlib mode is ahead of the 1679 PR, which itself was signifcantly ahead of develop (and gatk4 it turns out, but I didn't run all the myriad of post-processing tools, just QUAL filtering). The new code is MUCH faster too. --indels2.0 died:

bam2bcf_iaux.c:468: iaux_align_read: Assertion `qry_off1<=qry_off2' failed.

ONT "super" base call mode. (PAO83395.pass.cram from ONT)

dev     2306,2712,12507  251m22s                                                        
#1679   2403,2322,11918  733m42s                                                        
bias    2677,2132,12650  111m50s                                                        
--id2.0 crash

I still have more analysis to do here, along with the Ultima Genomics calling. Both the ONT and UG data set however are marginal on small indel detection due to the error models on the instruments. (If anyone knows of a published HG002 data set from ONT using "sup-dup" mode data then please feel free to point me at it. Similarly for any other data types out there.)

jkbonfield commented 9 months ago

Also, what do people want to see for evidence? I'm thinking of similar graphs to the ones I produced in the last plots, but I may adjust the counting.

My view is for diploid organisms we're making 2 calls at every site - one for each allele. For simplicity I want to switch with precision vs recall, so we can count REF/REF, REF/ALT and ALT/ALT as 2 calls each. If we call ALT/ALT and it's really ALT/ALT then that's two true variants. If we call it REF/REF then it's 2 false negatives, and if we call it REF/ALT then it's 1 true positive and 1 false negative. Similarly a true REF/REF called as ALT/ALT is 2 false positives, and as REF/ALT is 1 false positive.

I think that should be enough to give due importance to genotype assignment (which is one key improvement, especially on Illumina where it's a 6-fold reduction), while still keeping the graph simple. Comments?

pd3 commented 9 months ago

@jkbonfield As requested offline, here some information to the problem with the original draft we were discussing a while ago. I don't know if it is still an issue, but the observation back then was:

Each read was aligned to two consensus sequences generated from indels visible in the sample. In the case of one hom-ALT sample with an insertion and one hom-REF sample, the code triggered four realignments for the second sample, all to the same identical reference consensus, and the alternate consensus was not considered for the hom-REF sample at all. This lead to a low quality estimate.

The program should avoid recomputing the same read for the same consensus twice, just to save time. More importantly, if there aren't enough indel types within the sample, the most popular consensus from other samples should be used.

jkbonfield commented 9 months ago

On Mon, Dec 11, 2023 at 05:52:08AM -0800, Petr Danecek wrote:

@jkbonfield As requested offline, here some information to the problem with the original draft we were discussing a while ago. I don't know if it is still an issue, but the observation back then was:

Each read was aligned to two consensus sequences generated from indels visible in the sample. In the case of one hom-ALT sample with an insertion and one hom-REF sample, the code triggered four realignments for the second sample, all to the same identical reference consensus, and the alternate consensus was not considered for the hom-REF sample at all. This lead to a low quality estimate.

The program should avoid recomputing the same read for the same consensus twice, just to save time. More importantly, if there aren't enough indel types within the sample, the most popular consensus from other samples should be used.

Do you have any example data demonstrating this?

James

-- James Bonfield @.***) The Sanger Institute, Hinxton, Cambs, CB10 1SA

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

jkbonfield commented 9 months ago

I'm adding some analysis results of various GIAB HG002 samples. These show false positive rate (FPr) vs false negative rate (FNr), and also genotype assignment error rate (GTr) vs FNr.

FPr is defined as the fraction of called variants that are false. GTr is the fraction of true variants that were called with the incorrect genotype. FNr is the fraction of true variants that were not called.

"Truth" is a little bit flexible as there is an inherent error in measuring the same consequence via different calls. Eg 1 indel + 1 SNP vs 2 indel, etc.

Note: I'm reporting indels only here as this PR doesn't change the SNP calling.

Low-depth Illumina

High-depth Illumina

It's clear that the genotype assignment accuracy is the huge win here, especially when it comes to deeper data. This was systematically a flawed calculation in the old code due to the way it scores multi-allelic sites, but also aligning to a consensus instead of patched reference is clearly a significant win for genotype analysis too.

Shallow depth, there isn't a marked gain over --indels-2.0. It's a bit less sensitive at the non-filter level, and a bit better FP/FN ratio at the higher QUAL thresholds. For 60x data I also showed the old #1679 results. We're similar on genotype call accuracy, but FPs have been further reduced since then. This PR is substantially head of both devel and indels-2.0 on the deeper data. Sadly indels-2.0 is a bad option there, as it's even worse than develop.

Timing wise, develop, this PR ("new") and indels-2.0 were all in the ~66m for chr1 60x and ~23m for 15x. Not much difference.

I'll do other technologies in their own posts.

(And yes, I'm being lazy! Much less work to just cut and paste a full window than to save the image to disk, find it in firefox, and then reupload it again. "control-V" is less work :-)).

jkbonfield commented 8 months ago

Next up is BGISeq, with the new -X BGI profile and equivalent options explicitly stated in develop branch. I also included the published GATK calls uploaded to the GIAB site.

https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_BGIseq_2x150bp_100x/ https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_BGIseq_2x150bp_100x_07192023/

Deep 100x data:

On the full depth data, this PR is more accurate than the GATK results uploaded with the published dataset. Unfortunately the previous updates to --indels-2.0 are the worst performing due to a high false positive rate.

Shallow 10x data:

It's less clear with shallow 10x BGI data. Develop, indels-2.0, and this PR all have positions in the FN vs FP tradeoff where they're best, but arguably develop is best where it matters most (high recall) although frankly none are doing well. This PR however does still have significantly better genotype accuracy than the others.

Speed wise, the full data set was broadly the same speed for all bcftools runs, at approx 2hours.

jkbonfield commented 8 months ago

Next up, a modern 53x PacBio HiFi alignment from 20kb and 15kb libraries.

https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/GRCh38/

It comes with DeepVariant analysis too https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/PacBio_CCS_15kb_20kb_chemistry2_10312019/GRCh38/

53x data:

What's staggering here is clearly how far ahead DeepVariant is. It's obviously got far better understanding of the errors involved in this data. I did note there are clusters of errors that we make around supplementary reads, and that filtering those out (or alternatively filtering out anything with long soft-clips) removes almost all of those errors, but overall it's not a big impact as the number of affected regions is small. We still have a very significant number of false positives to remove though.

It may be worth investigating some of the tricks used in samtools consensus here too. I found these to be significantly beneficial for HiFi:

Recalibration of quality values, so the QUAL score actually ties in with the error likelihood. Note it needs recalibrating for SNP, INS and DEL separately too, or rather recalibrate once and have additional error type priors.
The notion of a localised mapping quality. MAPQ is almost useless on long-read data. What's more helpful is the number of differences within a local window, as it can help spot flights of fantasy and potential diversion caused by duplications.

Unfortunately I don't have --indels-2.0 results for PacBio as it hits an assertion failure very rapidly.

Speed wise, develop took 140 min and this PR 60 min, so is 2.3x faster.

jkbonfield commented 8 months ago

An older 2019 GIAB HG002 PacBio HiFi data set. This appeared to have different characteristics, and got worse with higher --indel-bias rather than better as the more modern one did. Both are assessed at the same value of 1.2 (default for CCS profile), as a middling setting.

More for comedy, I also including processing of samtools consensus -f pileup output. This doesn't officially support VCF as it's not a variant caller, but it's possible to turn the consensus into variants with a bit of scripting. It's not very sensitive as it has no realignment step whatsoever, focusing primarily of reporting the consensus of the data as-given. However high FN aside, it's at a reasonable place in the FN/FP tradeoff and it's actually the most accurate for genotype assignments! This possibly comes from the improved quality recalibration and local mapping quality assessments mentioned above, so there is room to borrow more ideas.

We're comfortably ahead of GATK on this data set, although that was my own run of GATK and I can't be sure I used it correctly as it's a complex beast. This PR is also still an improvement on the already-improving PR1679.

Speed wise, devel took 120 min, this PR 41 min (so almost 3x faster), and sadly --indels-2.0 crashed again. I don't have recordings of how long GATK took, but the date put in the header and the file time stamp implies about 250 min, which seems reasonable as I recall it being slower than current bcftools.

I think this places us in a good spot for calling performance. We're likely to never be matching DeepVariant, but we'd be a fraction of the CPU cost. (I haven't ran it myself, but being AI driven I can't believe it's any faster than GATK)

jkbonfield commented 8 months ago

Ultima Genomics.

This was a control-sample spiked in with other data, and is ~20x HG002. I have deeper data available for others (eg HG004), but not yet analysed. Ultimata Genomics main error model is indels, particularly homopolymers IIRC, so it performs quite poorly. However this PR is still ahead of develop.

For the develop run I copied the parameters from the new -X ultima profile (bar the new params added in this PR itself).

Not much more to say on this. Probably a work in progress still, so we can learn more on handling the error models for this data.

Speed was develop 34min, this PR 31min. So broadly comparable.

jkbonfield commented 8 months ago

ONT "sup" base caller, HG002 sample again

Disappointingly we're behind the old PR now, although again all of them are doing a really bad job. Even with the super mode basecaller the error rate is just too high for us to realiably call indels. Likely a better algorithm that's more in tune with the error profile could tease out a better result. I'll play more with the parameters to see if it's simply a matter of tuning things a bit. I'd like it to at least match the old PR here.

As with other long-read technologies this PR is significantly ahead on speed. 104min for this vs 317min for develop. So 3x faster again. As with other long-read technologies, --indels-2.0 crashes again:

bcftools: bam2bcf_iaux.c:468: iaux_align_read: Assertion `qry_off1<=qry_off2' failed.

jkbonfield commented 8 months ago

@jkbonfield As requested offline, here some information to the problem with the original draft we were discussing a while ago. I don't know if it is still an issue, but the observation back then was:

Each read was aligned to two consensus sequences generated from indels visible in the sample. In the case of one hom-ALT sample with an insertion and one hom-REF sample, the code triggered four realignments for the second sample, all to the same identical reference consensus, and the alternate consensus was not considered for the hom-REF sample at all. This lead to a low quality estimate.

The program should avoid recomputing the same read for the same consensus twice, just to save time. More importantly, if there aren't enough indel types within the sample, the most popular consensus from other samples should be used.

Can you please elaborate. I haven't looked at what it's doing under the hood yet, but it does get the right answer. All the versions I've tried have. My test:

@ seq4d[samtools.../bcftools]; cat _dat.pl
#!/usr/bin/perl

# GRCh38
#>chr1:100000-100019
#CACTAAGCACACAGAGAATAA

print "\@SQ\tSN:chr1\tLN:248956422\n";
print "\@RG\tID:s1\tSM:s1\n";
print "\@RG\tID:s2\tSM:s2\n";

my $DP=6/2;

# Sample 1, HOM INS +TT
for (my $i=0;$i<$DP;$i++) {
    print "s1a.$i\t0\tchr1\t100000\t60\t10M2I10M\t*\t0\t0\tCACTAAGCACACACAGAGAATA\tIIIIIIIIIIIIIIIIIIIIII\tRG:Z:s1\n";
    print "s1b.$i\t0\tchr1\t100000\t60\t10M2I10M\t*\t0\t0\tCACTAAGCACACACAGAGAATA\tIIIIIIIIIIIIIIIIIIIIII\tRG:Z:s1\n";
}

# Sample 2, HET INS +TT
for (my $i=0;$i<$DP;$i++) {
    print "s2a.$i\t0\tchr1\t100000\t60\t10M2I10M\t*\t0\t0\tCACTAAGCACACACAGAGAATA\tIIIIIIIIIIIIIIIIIIIIII\tRG:Z:s2\n";
    print "s2b.$i\t0\tchr1\t100000\t60\t20M\t*\t0\t0\tCACTAAGCACACAGAGAATA\tIIIIIIIIIIIIIIIIIIII\tRG:Z:s2\n";
}

@ seq4d[samtools.../bcftools]; ./_dat.pl |./bcftools mpileup --indels-cns -a AD --fasta-ref $HREF38 - 2>/dev/null | bcftools call -vm - 2>/dev/null | egrep -v '^#'
chr1    100009  .   CACA    CACACA  16.4975 .   INDEL;IMF=0.75;IDV=9;DP=12;VDB=5.52906e-06;SGB=-1.03866;RPBZ=-3.31662;MQBZ=0;MQSBZ=0;BQBZ=0;SCBZ=0;MQ0F=0;AC=3;AN=4;DP4=3,0,9,0;MQ=60   GT:PL:AD    1/1:39,18,0:0,6 0/1:12,0,127:3,3

So it's produced two samples, one with +TT on both alleles and one with +TT on one allele only. Tview shows me this:

 100001               
CACTAAGCAC**ACAGAGAATA
..........  ..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........**..........
..........AC..........
..........**..........
..........AC..........
..........**..........

They key thing here is the FORMAT fields:

GT:PL:AD 1/1:39,18,0:0,6 0/1:12,0,127:3,3

GT of 1/1 and 0/1, and AD of 0,6 and 3,3. That's spot on.

What am I missing about your scenario?

Also, is this still how people run multi-sample variant calling? I thought the world had moved on to gVCF. Do people do multi-sample analysis for small things like trios or tumor/normal analysis, and gVCF only for population studies? (I'm out of my depth here as I've only looked at simple-sample calling, partly due to the lack of decent high quality truth sets.)

jkbonfield commented 8 months ago

Ah, there is still a problem if sample 1 is HOM +CA and sample 2 is HET +TT. We get CA called as both HOM and HET and TT doesn't turn up anywhere, unless I split into two single-sample files, where it works fine.

 100001               
CACTAAGCAC**ACAGAGAATA
..........  ..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........AC..........
..........TT..........
..........**..........
..........TT..........
..........**..........
..........TT..........
..........**..........
..........TT..........
..........**..........
..........TT..........
..........**..........

This is true for develop, indels-2.0, indels-cns (this PR), and #1679. I think it's a fundamental flaw of the shared code that does indel type selection.

Develop calls the second sample as 0/0 with AD 5,0. (Correct AD given the alleles, but wrong call). indels-2.0 calls it 0/0 with AD 10,0. (Incorrect AD and call). indels-cns calls is 0/1 with AD 5,5. (Correct call if it was a single-sample file, but TT is missing from ALT; should be 1/2 and 5,5)

If I had +CA and +TTT then it works, as we have type 2 and type 3, but +CA and +TT leads to one type only, and so everything gets called as 2bp insertion. Really it needs to check the insertion consensus for each sample matches as that's what it actually produced when aligning. Ie CA,TT.

A useful spot, but again this isn't specific to this PR. It's just a weakness of the existing algorithm.

jkbonfield commented 8 months ago

Ah sorry I misread it. HOM ALT and HOM REF... That does give an issue with this new code which I need to investigate, as the QUAL field is very low for some unknown reason. However the calls are still correct. The original PR in #1679 works fine however.

Edit: blasted BAQ again! Disabled it and sure enough it works again. I've disabled it in most of the profiles, but I wasn't using one in my command-line tests and we can't change the default behaviour :(

So I still don't understand the failure case you observed. Data to demonstrate it would be useful please.

jkbonfield commented 8 months ago

Revised ONT parameters. Switched from -h80 to -h110 to remove false negatives (but at the expense of FP), and then --indel-bias 0.8 to try and remove those FPs (at the expense of FN). The combination appears to generally be a better FP/FN tradeoff, and is more sensitive to boot so at QUAL 0 it's significantly better recall.

My graph here is going all the way up to 50% FN, which is very high, so the lower part is the more important bit for general usage.

These changes do unfortunately take a hit on genotype accuracy, but it's probably more important to identify the indels first and genotype accuracy while important is the next thing to assess. Overall GT accuracy is still quite high for all methods.

jkbonfield commented 7 months ago

Attempts to fix some failing tests lead me to improve the way AD was calculated, although it still doesn't fix some of the failure cases. Overall it's a win though. I no longer reject alignments by setting the base type to 5 (unknown), instead preferring to adjust the quality of alignment. This gives higher, more realistic, AD figures now. I think this addresses some of the concerns you had over the old PR? However it does so at a cost of slightly higher misdiagnosis of the real genotype.

Current WIP figures. "10h" was my internal name for the what was listed as "new" above, and "12a" is my latest revision code name.

Illumina:

Better sensitivity at QUAL 0 and better FN vs FP tradeoff at most filter levels, but slightly less accurate on genotype assignment.

Similar deal for deep sequencing. GT is poorer, but both are miles ahead of dev or indels-2.0. FP vs FN wise, it varies on the filtering level. Low QUAL this previous PR was a bit better (albeit less sensitive), while higher QUAL this latest commit is more discriminating. Broadly comparable.

BGI

Considerably more sensitive than this earlier PR, although "dev" still beats it on lower QUAL filtering. As with Illumina, we're no longer so great on the GT assignment accuracy, with another 1-2% additional false assignment.

At deeper data, the new commits really help on BGI, both for FN vs FP but also vs GT error. Both are lightyears ahead of the old dev branch.

See earlier comments for comparison vs the GATK results that were published along side the GIAB BGI data set, and --indels-2.0 results.

More data sets incoming...

jkbonfield commented 7 months ago

PacBio HiFi (new HG002, full data at 53x).

Big win for both FP/FN and GT/FN plots.

Still a big win for shallow data, although sensitivity is a little down on before. It's still better than develop though.

jkbonfield commented 7 months ago

Ultima Genomics (as above, it's not that deep as it was just a control sample).

New commit is slightly detrimental, but still ahead of develop.

jkbonfield commented 7 months ago

ONT "super"; approx 50x coverage

There is a cross-over where both the earlier version of this PR and dev give a better FN/FP tradeoff, but this is at nearly 35% lost calls. The new commits have significantly improved recall rates which probably offer a better tradeoff.

However as many cases before, we see this comes at a cost of less accurate genotype assignments.

For sub-sampled 15x coverage:

Similar to 50x, but closer in FN/FP ratio. The new commits show higher sensitivity still, and both are considerably ahead of develop.

jkbonfield commented 7 months ago

Following requests outside of PR review here (but please do comment here rather than use out-of-band communication, for sake of transparency and to keep any other interested parties informed), I did some analysis of the various heuristics which had accrued over time, and culled some again as while they originally helped they're no longer so relevant following changes elsewhere. Hence the code is a little simpler now.

Of the ones left, I have some examples, both pro and con, of the sort of indel calls they alter, along with summaries of the total numbers of changes. See the TEST- comments in the source code for the various code blocks ifdefed in or out. (TEST 1, 3 and 6 are still in place.)

Tests below were on a region of HG002 chr1. My main tuning and training set is done on chr1:10M-20M.

TEST 1

30x Illumina

    With           :    Without
    FP    GT    FN :    FP    GT    FN
    22    31    51 :    23    47    61
    18    31   102 :    16    47   100
    12    30   153 :    14    46   150
    12    29   201 :    10    46   201
    11    29   251 :    10    43   254
    11    28   304 :     8    43   304
    11    28   353 :     8    43   359
     9    28   407 :     7    43   400

It significantly benefits genotype accuracy and improves sensitivity.

At low QUAL filtering it's a good FN/FP tradeoff. At higher filtering, the test adds to FP rate, indicating we're causing high quality FPs due to a small amount of ref bias.

False negatives: Example of missing (without H) high qual variant where one allele is lost (0/1 or 1/1 vs 1/2):

HG002 chr1:12606136 chr1:14048395 chr1:14465398 chr1:18282635

Example where no variant is called (without H)
HG002 chr1:13602026 (Q40)
HG002 chr1:11987311 13456803 (low qual)

15x Illumina

    With           :    Without
    FP    GT    FN :    FP    GT    FN
    43    67   148 :
    39    67   152 :    36   103   157
    16    63   203 :    16    99   204
    14    59   252 :    12    93   259
    12    57   307 :    10    91   302
    10    56   350 :     8    89   358
     8    51   402 :     7    86   404

Again GT error is reduced with this heuristic and a slight improvement to sensitivity. Again we see increased FP for fixed FN especially once filtering to higher QUAL levels.

53x PacBio HiFi

With            :    Without
 FP    GT    FN :    FP    GT    FN
216    74    48 :
175    74    52 :   225    80    50
 48    73   102 :    51    79   106
 39    70   160 :    43    76   162
 34    70   202 :    37    76   203
 31    65   256 :    32    71   258
 31    62   305 :    32    68   306
 30    58   360 :    29    64   362
 29    50   416 :    28    57   418

Not a huge difference, but a few more false positives called for low QUAL filtering and, again, a bit less sensitive at Q0.

False positives CAUSED by the heuristic:

    chr1    13602043        AT      A       81.0313 0/1
    chr1    15522545        C       CA      3.05998 0/1
    chr1    16368451        G       GT      42.2321 1/1

False postives CURED by the heuristic:

    chr1    10171892        A       ATAAG   8.51064 0/1
    chr1    10595815        C       CA      36.774  1/1
    chr1    11310767        A       AC      49.9678 1/1
    chr1    11937922        A       AT      52.4146 1/1
    chr1    14928448        C       CA      4.68159 0/1
    chr1    15203429        A       AT      11.5798 0/1
    chr1    15704701        G       GC      58.2334 1/1
    chr1    15738295        A       AG      61.2629 1/1
    chr1    17557345        A       AC      4.14406 0/1
    chr1    18999113        C       CA      7.74464 0/1
    chr1    18999124        T       TG      8.27893 0/1
    chr1    19664903        G       GC      52.2149 1/1

Looking at high quality ones: CURED chr1:15738295 1/1 Baffling this is labelled as homozygous It's a poly-G with 6 reads having +1G and 46 not. The +1G is false. (There are phased variants before and after this position and totally unphased with the poly-G changes)

CAUSED chr1:13602043 0/1 (called AT A by us and dev, ATGC A by indels-2.0) Tricky large insertion at 13602026 which can be confused as MNPs or multiple indels instead. Change is real and phased both up and down stream. (NB: chr1:13599373 is a neighbouring challenging phased poly-T difference. Real, but tricky to guage exact run-lengths)

We call the main insertion, but this is false. May need to track regions and call multiple indels in one go?

15x PacBio

With            :    Without
 FP    GT    FN :    FP    GT    FN
554   140   110 :   557   141   112
306   136   159 :   306   139   160
212   135   200 :   209   138   202
128   132   250 :   124   136   252
 99   127   308 :    96   130   310
 84   126   351 :    82   129   353
 75   121   400 :    72   125   401

Marginal difference on low coverage PB data too

Conclusion: not huge, but adds a little to sensitivity and it's a considerable reduction in genotype assignment errors, especially on low depth data.

Consider tuning how much reference spike-in we really need to optimise, or add a limit. Eg "if (rfract > 0.2) rfract = 0.2" will gain the bulk of the benefit without so many edge cases?

above: double rfract = (r - t2).75 / (r+1); try50: double rfract = (r - t2).50 / (r+1);

Change going FROM 75 TO 50 (chr1 10-20M).

bgi 10x: slight increase in GT error. Similar otherwise
bgi 30x: slight reduction in FP, a couple fewer FN
illumina 15x: very slight increase in GT
illumina 30x: almost identical

Not enough evidence to change yet.

jkbonfield commented 7 months ago

TEST 3

TEST 3: Compute indelQ in two ways (vs ref and vs next-best) and blend.

30x Illumina

 With           :    Without
 FP    GT    FN :    FP    GT    FN
 24    31    49 :
 24    31    50 :    22    31    51
 19    31   101 :    18    31   102
 13    30   153 :    12    30   153
 13    30   209 :    12    29   201
 12    30   254 :    11    29   251
 12    30   309 :    11    28   304
 12    30   355 :    11    28   353
 10    30   403 :     9    28   407

Helps sensitivity marginally, but also marginal gain in FP.

10x Illumina

 With           :    Without
 FP    GT    FN :    FP    GT    FN
 43    67   148 :    44    68   147
 39    67   152 :    40    68   151
 16    63   203 :    18    62   200
 14    59   252 :    15    60   253
 12    57   307 :    13    59   300
 10    56   350 :    11    57   352
  8    51   402 :     9    51   401
  7    47   455 :     8    48   452
  6    45   504 :     7    46   502

Small reduction to FP on shallow data, maybe marginal GT benefit too.

10x BGI

 With           :    Without
 FP    GT    FN :    FP    GT    FN
247   131   303 :   248   133   303
114   127   350 :   117   130   350
 81   121   404 :    83   123   405
 60   120   450 :    63   122   451
 36   116   500 :    40   118   505

Slight beneficial reduction in FP and GT.

53x PacBio

 With           :    Without
 FP    GT    FN :    FP    GT    FN
216    74    48 :   214    75    49
175    74    52 :   199    75    50
 48    73   102 :    48    74   106
 39    70   160 :    39    72   152
 34    70   202 :    34    71   207
 31    65   256 :    31    67   258
 31    62   305 :    31    63   309

Significant reduction in low-qual FP, but almost nil impact elsewhere bar the most minuscule GT change.

15x PacBio

 With           :    Without
 FP    GT    FN :    FP    GT    FN
554   140   110 :   545   142   114
306   136   159 :   321   139   154
212   135   200 :   220   137   200
128   132   250 :   137   135   250
 99   127   308 :   102   130   305
 84   126   351 :    84   128   353
 75   121   400 :    74   123   410

A slight reduction in FP rates, and slighter still to GT. Also a slight improvement in Q0 sensitivity.

Examples of low-coverage changes:

False postives CAUSED by the heuristic:

    chr1    10018582        CT      C       4.37617 0/1
    chr1    11799633        GT      G       5.56825 0/1
    chr1    12201471        TA      T       4.75713 0/1
    chr1    12580996        TA      T       12.4837 0/1
    chr1    14615625        CA      C       4.86644 0/1
    chr1    17445094        CA      C       4.40545 0/1
    chr1    17479481        CA      C       17.9461 0/1
    chr1    19200268        TA      T       6.58196 0/1
    chr1    19944115        AT      A       11.603  0/1

False postives CURED by the heuristic:

    None

It appears that heuristic is detrimental, and indeed it makes mistakes. We see this in QUAL 0 FP being 554 vs 545. However they are all low quality.

chr1:17479481 is highest qual here. It's a poly-A with counts 13A=2, 12A=9, 11A=5. Indels-2.0 and dev both call it too (qual 37 and 5 respectively).

Checking the COMMON FPs, with the two last columns being QUAL (with, without):

1       chr1    14146180        G       GT      0/1     6       5
2       chr1    15556125        CT      C       0/1     22      19
2       chr1    18270779        C       CA      0/1     22      20
3       chr1    15805897        CT      C       0/1     7       4
4       chr1    11464319        ATT     A       0/1     113     108
4       chr1    19710270        ATT     A       0/1     84      79
6       chr1    10967155        GAAAA   G       0/1     148     142
6       chr1    13666839        CT      C       0/1     11      4
11      chr1    10500241        TA      T       0/1     21      10
11      chr1    11196849        TA      T       0/1     15      4
11      chr1    11331008        TA      T       0/1     15      4
14      chr1    11115059        TA      T       0/1     33      19
21      chr1    16084815        ACC     A       0/1     148     126
22      chr1    12483996        CA      C       0/1     37      15
23      chr1    15847116        TA      T       0/1     30      7
26      chr1    12068876        CT      C       0/1     66      40

Sorted by delta of QUAL-with and QUAL-without. There are none where the delta is negative, so 16 of 545 are higher QUAL with heuristic than without. Ie FP is stronger.

Again, this looks detrimental. Particularly those with QUAL delta of 20 or more. chr1:12068876 is long poly T. 14T=2, 13T=8, 12T=8.

Looking at true variants, I see 1974 common calls. 105 have higher QUAL with the heuristic (max delta +40) and 8 with QUAL lower (maximum delta is -4).

QUAL delta on COMMON TP, >= 20.

20      chr1    17748405        CT      C       0/1     72      51
21      chr1    10566493        CT      C       0/1     94      72
21      chr1    16084815        AC      A       1/0     148     126
21      chr1    19138672        TA      T       0/1     84      62
23      chr1    11713198        CA      C       0/1     55      31
23      chr1    18424845        CA      C       0/1     53      30
24      chr1    14595143        CA      C       1/1     158     134
30      chr1    16096322        CA      C       1/1     115     84
30      chr1    16333436        AC      A       1/1     138     108
30      chr1    18611271        GAGGAAGGAAGGA   G       1/0     96      65
40      chr1    15850400        GT      G       1/1     135     95

So the heuristic benefit is primarily one of discrimination which is hard to demonstrate with single examples. It's just a numbers game.

This isn't surprising giving the heuristic isn't really changing calls (mainly affected by seqQ), but is instead changing their qualities.

Conclusion, a small benefit overall

jkbonfield commented 7 months ago

TEST 6

Depth based seqQ cap in bam2bcf.c

15x PacBio

 With           |    Without
 FP    GT    FN |    FP    GT    FN
554   140   110 |   791   148   103
306   136   159 |   483   144   150
212   135   200 |   330   142   203
128   132   250 |   231   140   252
 99   127   308 |   178   135   306
 84   126   351 |   154   129   359
 75   121   400 |   129   124   406

53x PacBio

 With           |    Without
 FP    GT    FN |    FP    GT    FN
216    74    48 |   686   138    37
175    74    52 |   243   137    50
 48    73   102 |   146   128   100
 39    70   160 |   120   120   153
 34    70   202 |   110   110   201
 31    65   256 |   102    99   255
 31    62   305 |    98    92   302

Huge reductions in FP rates, and for deeper data also for GT assignment errors. This is instrumental in improving calling accuracy, at least of diploid data.

30x Illumina

 With           |    Without
 FP    GT    FN |    FP    GT    FN
                |    52    37    42
 22    31    51 |    38    37    50
 18    31   102 |    27    35   103
 12    30   153 |    22    33   150
 12    29   201 |    21    33   201
 11    29   251 |    20    31   251
 11    28   304 |    17    26   302
 11    28   353 |    17    22   350
  9    28   407 |    13    20   405

Significant FP reduction

100x BGI

 With           |    Without
 FP    GT    FN |    FP    GT    FN
 17    11   121 |   110    70   120
  9     9   150 |    38    64   151
  8     8   200 |    30    56   239
  6     8   252 |    14    40   269
  6     7   306 |  
  6     6   355 |    11    37   378
  5     5   400 |    11    33   404

Similar sensitivity, but huge FP and GT error reduction.

Without this rule it's also highly overvonfident in QUAL. It's not calibrated, but Q170 is 1 in 10^17 chance of an error, yet we have 38 out of 1921 => Q17.

BGI 100x analysis:

False postives CAUSED by the heuristic, sorted by qual:

none

False postives CURED by the heuristic, sorted by qual:

11263494        220.12  TAC     T       0/1 54,12   no variant
10090793        220.041 TTG     T       0/1 86,17   no variant
11192748        195.274 TAA     T       0/1 2,19    truth TA   T
11379137        195.08  CAA     C       0/1 1,16    truth CA   C
11782918        194.834 CAA     C       0/1 11,24   truth CA   C
11265500        194.728 ATTT    A       0/1 4,17    truth ATT  A
11140983        194.719 G       GAA     0/1 6,17    truth G    GA
14193661        194.626 GTT     G       0/1 1,16    truth GT   G
19464020        194.577 CAA     C       0/1 3,12    truth CA   C
11618673        194.436 CAA     C       0/1 8,11    truth CA   C
11398727        194.416 A       ATT     0/1 14,11   truth A    AT
10523628        194.257 TAA     T       0/1 3,16    truth TA   T
16183133        194.201 CAA     C       0/1 10,15   truth CA   C
17486147        193.968 C       CAA     0/1 3,8     truth C    CA
18864334        193.947 CTT     C       0/1 10,12   truth CT   C
11603885        193.819 C       CAAA    0/1 6,17    truth C    CAA
15284478        193.489 C       CAAA    0/1 5,13    truth C    CAA
19820846        193.333 TAA     T       0/1 6,15    truth TA   T
14933849        193.113 CTT     C       0/1 4,9     truth CT   C
10514156        191.236 C       CTG     0/1 0,11    truth C    CTGTG
10918938        191.088 ATTTT   A       0/1 0,17    truth ATTT A
10454856        190.293 CTTTT   C       0/1 0,12    truth CTTT C
16187322        189.738 TAAAA   T       0/1 0,13    truth TAAA T
10967155        189.57  GAA     G       0/1 0,7     truth GAAA G
15833706        189.247 TA      T       0/1 0,12    truth TAA  T
12192276        186.096 CTG     C       0/1 70,11   no variant
10984598        117.13  C       CT      0/1 67,15   no variant
15867712        107.302 CT      C       0/1 82,19   no variant
19699486        103.187 CAT     C       0/1 69,10   no variant
...
(93 entries)

chr1:11263494 (no var) is 12 -AC out of 91, so could be justified as a variant, but with such deep data we expect a more reasonable allele rate so the likelihood of it being true is decreased (which is achieved by capping seqQ per read). Similarly chr1:10090793.

The others are called as GT 1/2 when in reality it's 0/1 or 1/1. (We 0/1 above due to separating them for purposes of allele counting.)

Eg chr1:11398727 is identified as 14 REF, 36 +1T, 11 +2T. We call 0/1 with heuristic or 1/2 without. Truth is 1/1 and the few REF matches as erroneous.

Comparison of those with indels-cns (with heuristic), indels-2.0 and develop shows the new heuristic works well.

          Truth                 Indels-cns      indels-2.0      dev
10090793                                        FP 1/2          FP 0/1
10454856  CTTT  C       1/1     Y               1/2             1/2
10514156  C     CTGTG   1/1     Y               1/2             1/2
10523628  TA    T       1/1     Y               1/2             1/2
10918938  ATTT  A       1/1     Y               1/2             1/2
10967155  GAAA  G       1/1     Y               1/2             1/2
11140983  G     GA      1/1     Y               1/2             1/2
11192748  TA    T       1/1     Y               1/2             1/2
11263494                                        FP 0/1          1/2
11265500  ATT   A       1/1     Y               1/2             1/2
11379137  CA    C       1/1     Y               1/2             1/2
11398727  A     AT      1/1     0/1             1/2             1/2
11603885  C     CAA     1/1     Y               1/2             1/2
11618673  CA    C       1/1     Y               1/2             1/2
11782918  CA    C       1/1     Y               1/2             1/2
12192276                                        FP 0/1          1/2
14193661  GT    G       1/1     Y               1/2             1/2
14933849  CT    C       1/1     Y               1/2             Y
15284478  C     CAA     1/1     Y               1/2             1/2
15833706  TAA   T       1/1     Y               1/2             1/2
15867712                                        FP 1/2          1/2
16183133  CA    C       1/1     Y               1/2             1/2
16187322  TAAA  T       1/1     Y               1/2             1/2
17486147  C     CA      1/1     Y               1/2             1/2
18864334  CT    C       1/1     Y               1/2             1/2
19464020  CA    C       1/1     Y               1/2             1/2
19699486                                        FP 0/1
19820846  TA    T       1/1     Y               1/2             Y

False positives SHARED by both with/without heuristic (with QUAL scores)

10523825    162.015 TAAA    TA,T     1/2 0,26      truth TAA T 1/1
17775360    108.287 TGGA    T        0/1 44,63     diff representation
12536917    103.504 C       CTT,CTTT 0/1 6,12      truth C CTT 1/1
17967242    94.4016 G       GATTC    0/1 37,37     diff representation
...

chr1:10523825 vs ref, I see -4A=9 -3A=19 -2A=66. Seems unlikely we'd not call 1/2 (and also called by indels-2.0 and devel). Just hard judgement in the data. Can get a correct call with "-o 10 -e 1", but that'd leads to false negatives elsewhere (tested as 540 at Q0).

False negatives CAUSED by the heuristic, sorted by qual:

16060510    50      C       CT      0|1 23,29

False negatives CAUSED by the heuristic, sorted by qual:

none

However each platform needs differing seqQ cap as the biases vary. It's reliant on the amount of systematic bias, so how much we can trust depth. There may be some over-fitting to specific runs here, as I've only tested one run per platform (sometimes all I have available for a known truth set), but I tuned per platform and hope the bulk of the variation is platform specific rather than run specific. We may need to retune in future.

It would probably also harm calling of cancer data where we're not expecting hardy-weinberg stats, but they need their own dedicated caller.

jkbonfield commented 7 months ago

All of the data files here are public GIAB files. The HG002 ones are downloaded locally in Sanger at ~jkb/lustre/GIAB/chr1/

Files: HG002.GRCh38.15x.RG.bam HG002.GRCh38.30x.RG.bam HG002.GRCh38.60x.RG.bam HG002.SequelII.merged_15kb_20kb.pbmm2.GRCh38.haplotag.10x.bam pb_new.15x.bam (subsample of above) BGI.100x.bam BGI.30x.bam BGI.10x.bam

Plus in the ../ont and ../ultima directories too. The number of examples potential for being created to add to your evaluation set is considerable, but I'll let you have an explore as to which you think are best. I just gave an example summary above based on bcftools isec to find new-errors created or old-errors fixed.

jkbonfield commented 7 months ago

To answer the suggestion that I've been overfitting the data by tuning each set of parameters so much, I took the GIAB HG005 (Chinese Trio) sample instead. This isn't ideal as it's less well studied so the truth set may have more errors, and there aren't so many machine runs uploaded, but it's sufficient to get an idea of how it performs.

First up BGI:

15x

It's not ideal, as we're behind devel on the FP vs FN curve, but the gap closes once we start filtering out low-QUAL calls. Never quite passes it though. Genotype accuracy (not plotting as it's almost a straight line as all genotype assignment errors are obviously from high quality calls that don't get filtered) is better though - ballpark down from 420 errors to 360 errors.

130x

We're around about 3x fewer false positives for the same recall rate, plus our unfiltered recall is considerably higher. Indels-2.0 is performing really badly on this data set.

As before, GT accuracy is just vertical lines in the 0-3% FN bracket. At Q0 it's dev=301, indels-2.0=333, indels-cns=13. A colossal improvement in genotype assignment accuracy.

jkbonfield commented 7 months ago

Next up, Illumina HiSeq. It's not the most modern data, but I haven't yet found any for HG005. I'll have a look next week maybe.

15x

Indels-cns has a considerable boost to recall rate, while broadly speaking following the FP-vs-FN trend of indels-2.0 on the shallow data once we start to filter by QUAL. As before, we're a bit behind on genotype assignment accuracy for shallow data, with QUAL0 being dev=146 err, indels-2.0=284, and indels-cns=186. So not hugely behind, but a bit.

50x

At 50x we've got enough data to start making reasonable calls for both devel and indels-cns, although once again indels-2.0 is showing itself to be a poor performer with regards to false-positives. Indels-cns is the most sensitive method. With the deeper data however it's now also the most accurate at genotype assignment, with Q0 error counts being dev=120, indels-2.0=204, indels-cns=23. So 4-5x fewer errors there.

300x

The full data set is overly deep. I have --indels-2.0 still running, but it's likely to end up in the same position. Both dev and indels-cns understandably do a good job, but indels-cns is still the clear winner. It's now also only making 4 genotype assignment errors, compared to 57 for develop.

jkbonfield commented 7 months ago

Now for PacBio HiFi data. Note on the long read data, I find --indels-2.0 always crashes, so the results are missing.

15x Both algorithms are really error prone and call lots of false positives. Genotype assignment error is very similar between the two, at about 6% error rate.

Nonetheless, indels-cns has a much better sensitivity for any given FP rate (or vice versa).

50x (Sorry for typo in setting plot title - it is 50x, not 10x)

This is now looking like chalk vs cheese! The FN rate of devel is really poor, and misses 3x as many indels as --indels-cns. The false positive rate is more respectable, but again indels-cns is considerably ahead. To get to ~160 false negatives (where dev starts at Q0), indels-cns is making around 10x fewer false positives, mainly because it started the curve so much lower down.

As we saw before once we get deeper data, genotype assignment is now considerably improved over develop, dropping from approx 640 errors down to 240.

jkbonfield commented 7 months ago

Conclusion: I hope this addresses your concerns about over-fitting. I haven't demonstrated it on ONT (I'm not sure if modern "super" basecaller data is available for HG002), but it looked promising with HG002.

This algorithm is considerably better at assigning genotype, and the recall vs precision graph is strongly in favour of the new algorithm. Sadly it's strongly demonstrating that indels-2.0 is suffering, so I would also recommend (in a separate PR) marking it as deprecated once this PR is merged.

Files for the HG005 analysis are locally in ~jkb/lustre/GIAB/HG005. That includes BAM files, VCFs, and plots. You should be able to do bcftools isec on any of the dev vs "edlib" (this PR) VCFs to generate example new-FP, new-FN, old-FP and old-FN corpuses for your test system from there.

jkbonfield commented 7 months ago

ONT data too. This is terrible at indel calling with bcftools as it was old data with an old base-caller, so quite poor on error rates, especially indels. Modern ONT base-callers are far superior, but I don't have any data for HG005.

It's pretty ropey with a huge FP rate (vs about 5% on the "Super" mode for HG002 above). However it's clearly more sensitive and at all filtering levels offering a better FN vs FP tradeoff than develop. Likewise the genotype error rate is significantly reduced at all qual levels (for a given FNr).

jkbonfield commented 7 months ago

Some newer data - HG005 Ultima Genomics. Unlike the HG002 data this was quite deep.

59x Ultima

This data came to me with DeepVariant and GATK HaplotypeCaller joint results. Surprisingly bcftools is doing really well here compared to others on recall, although our genotype assignment accuracy is lagging. This PR extends the lead further.

*15x Ultima

As seen on other instruments, we're a bit behind develop on accuracy of genotype assignments for shallow data, but it's more than made up by the improved recall overall.

jkbonfield commented 7 months ago

I've moved this out of draft as I'm now happy with it to be reviewed.

Hopefully this demonstrates the gains to be had. I've validated it now makes zero difference to the default usage of bcftools.

Note however it does change the non-versioned profiles, eg -X illumina now uses -X illumina-1.20 which is the newer variant caller. This could be changed, but the parameters in the profile have been tuned with the new caller in mind, so using -X illumina-1.20 --no-indels-cns won't give the same results, and it may not be better than straight off -X illumina-1.18 either.

In my mind, we should always make sure the non-version numbered profiles use the best options available as that's what most people are going to do. It's only people who want continuity between releases, upgrading purely for major bug-fixes but not accepting anything that changes output, that should consider using a release-specific parameter set. (To this end, the -1.20 also contain a bug fix to IDV and IMF INFO fields which could go into 1.18, but I'm taking the request to not change old profiles literally here.)

pd3 commented 7 months ago

In my mind, we should always make sure the non-version numbered profiles use the best options available as that's what most people are going to do.

A major revamp like this needs to be tested by more people and on more data sets, before it becomes the new default.

To this end, the -1.20 also contain a bug fix to IDV and IMF INFO fields which could go into 1.18, but I'm taking the request to not change old profiles literally here

Bug fixes are a different thing. Not sure why we wouldn't want to include it.

jkbonfield commented 7 months ago

In my mind, we should always make sure the non-version numbered profiles use the best options available as that's what most people are going to do.

A major revamp like this needs to be tested by more people and on more data sets, before it becomes the new default.

Understood, but what's the best naming? We discussed previously changes and you made it clear that irrespective of testing this could never become the default as people don't expect changing results.

However I took the approach of bcftools without command line arguments = default. Bcftools with specified profiles could be different. For consistency, I felt that people who are using a profile and have a hard requirement that the output never changes, should have a profile named after a bcftools release (eg illumina-1.18) while people who do not required hard never-changing results could use a non-release tagged profile that upgrades automatically to what we personally believe represents the best caller. That feels like a good pragmatic solution and what I've done here.

Do you agree?

If so I guess the final question is what do we do about new profiles that are undergoing testing? Should I just rename them to 1.21, so there is a clear statement that it's experimental for this release and (if no problems found) will become the new default next time around? That feels like good signalling.

To this end, the -1.20 also contain a bug fix to IDV and IMF INFO fields which could go into 1.18, but I'm taking the request to not change old profiles literally here

Bug fixes are a different thing. Not sure why we wouldn't want to include it.

Well, because of the above statement basically. Also when is a bug a bug?

My fix I'm thinking of is IDV and IMF are computed based on the number of records containing the indel called originally. Not the number called after we validated them. We've always had this as bcftools has always realigned reads (previously to patched ref, now to consensus), but it used the wrong data when filling out INFO. We could argue over bug or feature, and I can see both sides, which is why I chose to not fix it.

However it comes down to the same issue. If we take the route that people should never be getting changed results, which was your firm belief before and why you rejected the previous PR, then we also shouldn't be making such changes without producing a new version-numbered profile.

pd3 commented 7 months ago

My apologies, I completely misunderstood your words. I agree with what we agreed before:

profiles like -X illumina are what we think is best at the moment
profiles like -X illumina-1.18 remain stable forever, possibly with glaring bugs fixed
major changes are added as a new option at first. After a few releases they can become the new default.

It was not my brightest moment, sorry again.

jkbonfield commented 7 months ago

No problem. It's a new direction and not something we've done before. I'm not sure it's perfect, but it feels like a reasonable strategy.

The question still remains though as to when the non-versioned profile switches. If this is present in bcftools 1.20 but we don't want "bcftools mpileup -X illumina" then we need a different profile name than "illumina-1.20". I'm open to suggestions, but maybe illumina-1.21 or illumina-experimental? The problem with the latter is we'd want to delete it after it becomes non-experimental.

New options is a tricky one too. The profile configuration and the option are inherently entangled. We could obviously do -X illumina-1.18 --indels-cns, but it wouldn't be as good as -X illumina-1.20 as the former is tuned around BAQ scores while the latter is tuned around edlib + consensus scores. (Although -X illumina-1.18 --indels-cns should still be an improvement).

Edit: actually that's a really bad example as there were no changes made to Illumina! However PacBio for example gained a lot of new parameters, such as --poly-mqual and --del-bias, as well as other parameters (-h) being preferable with the different reference sequence construction and alignment scoring algorithm.

I don't have --indels-cns benchmarked using the old profile, but I do have the old variant called (bam2bcf_indel.c) with new and old parameters, minus the --indels-cns bit:

pacbio 50x                                                                      
 | dev-1.18          |  dev 1.20          |  indels-cns                         
 |    FP    GT    FN |    FP    GT    FN  |    FP    GT    FN                   
-+-------------------+--------------------+------------------                   
 |                   |                    |  1721   240    51                   
 |                   |                    |   576   239   106                   
 |  3569   647   159 |                    |   381   238   157                   
 |  1804   641   200 |                    |   315   238   201                   
 |  1175   635   257 |  3186   619   276  |   254   238   270                   
 |   898   625   300 |  2125   617   300  |   231   238   312                   
 |   751   617   350 |  1177   613   352  |   210   238   353                   
 |   661   607   412 |   858   606   401  |   195   238   413                   
 |   615   592   464 |   710   601   456  |   178   238   470                   
 |   566   585   519 |   601   584   515  |   170   238   509

So we see using -X pacbio-ccs-1.20 --no-indels-cns over -X pacbio-ccs-1.18 makes the indel caller less sensitive and with a higher FP rate for any given FN rate. In short, considerably worse. However combined with --indels-cns it becomes considerably better. If you're interested I could try -X pacbio-ccs-1.18 --indels-cns combined to show how that worked too, to get the 4th combination.

Edit: I ran HG002 chr20 with -X pacbio-ccs-1.18 --indels-cns now and see this:

    FP    GT    FN
  2535   285    42 
  1210   285    50 
   553   284   102 
   431   283   153 
   345   283   202 
   308   283   253 
   280   283   302 
   248   283   368 
   232   283   424 
   218   283   456

So at ~50FN it's fewer FP, but if we do QUAL filtering to get approx comparable FP and FN (maximising discrimination and F1 score) we're around 20% more errors. But that's still a huge improvement over the ~double error of develop, so just adding --indels-cns to the old profile is still viable in this single test, just not optimal.