Open jkbonfield opened 2 months ago
True, the option --indels-2.0
is far from finished, but it is not intended for deprecation.
I'll use the opportunity to document its current state and explain its main idea: it is intended to replace the old mpileup indel calling code and make it modular, so that new concepts can be plugged in and tested more easily. It also changes the overall structure of the algorithm, so that nested loops naturally support diploid or even polyploid genomes
// find all indel types at the tested position
indel_types = find_candidate_indels()
for each sample in samples do
// create one, two, or more consensus templates, depending on genome ploidy
cons = create_consensus(sample)
for each template,indel_type in cons,indel_types do
// apply each indel type on top of the consensus templates
ref = create_reference(template,indel_type)
// align reads
for each read in reads do align_reads(ref, read)
// score reads and calculate per-read indel quality
score_reads(reads);
// find site-level indels, trim unused indel candidates
evaluate_indels()
Currently the algorithm of consensus template creation is as follow:
Apart from small technical problems, such as the assert you're quoting, the consensus building algorithm described above is not perfect and should be improved.
Understood, buit please think of this from a user perspective:
$ bcftools mpileup
...
--indel-size INT Approximate maximum indel size considered [110]
--indels-2.0 New EXPERIMENTAL indel calling model (diploid reference consensus)
--indels-cns New EXPERIMENTAL indel calling model with edlib
--seqq-offset Indel-cns tuning for indel seq-qual scores [120]
...
There's no way for the user to see that the current caller "works well", the 2.0 caller "works less well" and the indels-cns caller "works better". That's in generalised terms of course as obviously all implementations will have some cases they get better and some they get worse.
I'm not arguing against doing more work nor that it is necessary, but that it is exposed to the users and advertised before it's in a viable state to be used. Maybe it simply needs removing from the usage statement. If so I'd also change the "with edlib" to "with diploid consensus". I only added the edlib there because I couldn't really use the term I wanted to as it was already being used.
Alternatively, drop the EXPERIMENTAL from indels-cns. It's well tested and debugged and is in a different ballpark to indels-2.0 for robustness and polish. Clearly it's not perfect (nothing is, including the original caller), but as far as I'm concerned it's a finished project and I have no intention of further development of that algorithm.
I am happy for the wording to be changed
Indels-2.0 is somewhat buggy at present. On several tests it simply fails. For example the test data from https://github.com/samtools/bcftools/issues/2277.
Often this is due to the use of asserts to validate assumptions. Those validations are perhaps fine, but assert is not. Either it needs the asserts replacing by return codes to indicate errors, so that the alignment is rejected, or it needs canning as functionality. Basically it looks to be part-way through development and more of a prototype than a usable piece of code, although it is labelled as experimental to reflect that.
If I recall it was an attempt to rewrite my old PR (https://github.com/samtools/bcftools/pull/1679), but in doing so introduced bugs and lost most of the improvements in calling accuracy. Ultimately I improved on that PR and it got merged as
--indels-cns
, so we may need to decide what to do with the--indels-2.0
functionality.Right now though I think it's just confusing for the users having multiple experimental options.