samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
680 stars 240 forks source link

Document / Improve / Polish / Fix --indels-2.0 #2280

Open jkbonfield opened 2 months ago

jkbonfield commented 2 months ago

Indels-2.0 is somewhat buggy at present. On several tests it simply fails. For example the test data from https://github.com/samtools/bcftools/issues/2277.

bcftools: bam2bcf_iaux.c:468: iaux_align_read: Assertion `qry_off1<=qry_off2' failed.

Often this is due to the use of asserts to validate assumptions. Those validations are perhaps fine, but assert is not. Either it needs the asserts replacing by return codes to indicate errors, so that the alignment is rejected, or it needs canning as functionality. Basically it looks to be part-way through development and more of a prototype than a usable piece of code, although it is labelled as experimental to reflect that.

If I recall it was an attempt to rewrite my old PR (https://github.com/samtools/bcftools/pull/1679), but in doing so introduced bugs and lost most of the improvements in calling accuracy. Ultimately I improved on that PR and it got merged as --indels-cns, so we may need to decide what to do with the --indels-2.0 functionality.

Right now though I think it's just confusing for the users having multiple experimental options.

pd3 commented 1 month ago

True, the option --indels-2.0 is far from finished, but it is not intended for deprecation.

I'll use the opportunity to document its current state and explain its main idea: it is intended to replace the old mpileup indel calling code and make it modular, so that new concepts can be plugged in and tested more easily. It also changes the overall structure of the algorithm, so that nested loops naturally support diploid or even polyploid genomes

// find all indel types at the tested position
indel_types = find_candidate_indels()

for each sample in samples do
      // create one, two, or more consensus templates, depending on genome ploidy
      cons = create_consensus(sample)

      for each template,indel_type in cons,indel_types do
            // apply each indel type on top of the consensus templates
            ref = create_reference(template,indel_type)

            // align reads
            for each read in reads do align_reads(ref, read)

            // score reads and calculate per-read indel quality
            score_reads(reads);

// find site-level indels, trim unused indel candidates
evaluate_indels()

Currently the algorithm of consensus template creation is as follow:

  1. Identify heterozygous variant positions in the realignment window across all reads
  2. Sort these candidate variants by abs(variant_allele_freq-0.5) in descending order (i.e. the closer to 0.5, the more likely will these become a differentiating element of the two consensus template sequences)
  3. Take the top sorted variants (up to 8 to fit in uint8_t) and count the number of reads that support these haplotypes (i.e. contain these variants in various combinations). This creates a haplotype frequency spectrum.
  4. Correct errors by collapsing spurious haplotypes to obtain just two consensus template sequences
  5. Any other mismatches are incorporated into both versions of the consensus template sequences by a majority vote

Apart from small technical problems, such as the assert you're quoting, the consensus building algorithm described above is not perfect and should be improved.

jkbonfield commented 1 month ago

Understood, buit please think of this from a user perspective:

$ bcftools mpileup
...
      --indel-size INT    Approximate maximum indel size considered [110]
      --indels-2.0        New EXPERIMENTAL indel calling model (diploid reference consensus)
      --indels-cns        New EXPERIMENTAL indel calling model with edlib
      --seqq-offset       Indel-cns tuning for indel seq-qual scores [120]
...

There's no way for the user to see that the current caller "works well", the 2.0 caller "works less well" and the indels-cns caller "works better". That's in generalised terms of course as obviously all implementations will have some cases they get better and some they get worse.

I'm not arguing against doing more work nor that it is necessary, but that it is exposed to the users and advertised before it's in a viable state to be used. Maybe it simply needs removing from the usage statement. If so I'd also change the "with edlib" to "with diploid consensus". I only added the edlib there because I couldn't really use the term I wanted to as it was already being used.

Alternatively, drop the EXPERIMENTAL from indels-cns. It's well tested and debugged and is in a different ballpark to indels-2.0 for robustness and polish. Clearly it's not perfect (nothing is, including the original caller), but as far as I'm concerned it's a finished project and I have no intention of further development of that algorithm.

pd3 commented 1 month ago

I am happy for the wording to be changed