regression of mpileup performance on Nanopore

mbhall88 commented 3 years ago

I have seen some differences in recall and precision with the big changes in the mpileup command in v1.13 with Nanopore data.

The first thing to say though is v1.13 is a massive improvement in terms of both CPU and memory usage. :fire:

I'm not sure of the best place to start this discussion so I'll start with a vague description of the scenario and I can provide more details for what you feel is relevant.

I am performing some SNP calling precision and recall analysis for Mycobacterium tuberculosis samples which I have Illumina and Nanopore data for. Seven of these sample I also have PacBio CCS for so I am used the CCS assemblies as truth genomes and masking spurious locations in further analysis.

I had fine-tuned some filters for version <=1.12 that provided precision on par with Illumina and recall a few percentage points lower than Illumina. However, these filters fail miserably for v1.13 - recall drops off a cliff, but precision stays good.

In addition to the seven samples with CCS assemblies, I have another 143 samples with Illumina and Nanopore data. My proxy for comparing these is pairwise SNP distance between samples for both technologies. The distance between a pair of samples should be the same on both technologies, so any difference are technology-driven. With v1.13 I am seeing a huge drop in the pairwise distance between samples on Nanopore - i.e., recall is very bad.

I have tried playing around with almost all of the mpileup parameters I think might be useful - full BAQ, partial BAQ, I tried the nanopore config, and basically all of the genotype likelihood parameters. I basically can't find a set of filters that provide a good balance between precision and recall for v1.13.

Normally I would just ignore this and stick with v1.12. But I figured this is likely helpful for others, and v1.12 uses an insane amount of memory and takes a few hours, compared to v1.13 which uses nearly no memory and CPU time comparatively.

Apologies for the rambling nature of this issue. I'm happy to provide plots, some data, try out some parameters etc.

jkbonfield commented 3 years ago

It's almost certainly one of the many things that went on in this PR:

https://github.com/samtools/bcftools/pull/1474

My own tests showed the opposite; an improvement to long-read calling. Data would certainly help in diagnosing the issue.

Does adding --config 1.12 cure it? This changes most things back to how they were. However there are some changes which aren't configurable too. Most of these were involved in the formulae used for computing genotype. The filtering expressions are definitely different too, with a more GATK-line Z score being used instead of the probability scores. The Z-score is logged which makes it much easier to filter on than an asymptotic number very close to 0 or 1, however it will change how existing filters work. Assuming my commit logs are correct, the -U option will reenable the old filter scale.

jkbonfield commented 3 years ago

Also for 1.12, one thing you could do for speeding it up is turning off BAQ (-B) and turning off indel calling (-I). Indel calling also internally uses BAQ and the -B option does not disable this. BAQ is glacial on long reads, to the extent of being virtually unusable, plus as you've found it's also excessively memory hungry.

I'm curious to know if 1.12 without BAQ/Indels still has the problem. If so, then it implies this is a fundamentally hard problem to solve as it's BAQ which is rescuing the SNPs. I did have ideas of doing a sub-BAQ, to call it on small fragments instead, but I can't recall if that made it in or not now.

jkbonfield commented 3 years ago

Note you may also find the instructions at https://jkbonfield.github.io/www.htslib.org/workflow-filter/ of use for identifying the optimal filters in the most recent mpileup. Although this does require a truth set so you know what the correct and incorrect calls are.

Edit: to @pd3, @daviesrob - this PR appears to have never been merged. Was there are reason why the documentation didn't make it to the website?

mbhall88 commented 3 years ago

Does adding --config 1.12 cure it?

No, this, and -U don't provide results in line with what I was getting previously, they're actually more like v1.13 results than v1.12 results.

Also for 1.12, one thing you could do for speeding it up is turning off BAQ (-B) and turning off indel calling (-I).

So I am using -I on v1.12, but found -B hurt recall (as expected), but provides much better CPU and memory usage as you say. I suspect there isn't much I can do about this though, but thought it worth at least documenting this somewhere.

I will take a look at that filtering document, it looks very thorough! I'll try that out and get back to you.

pd3 commented 3 years ago

@jkbonfield I was not aware there is a documentation pull request. It is marked as a work in progress, could that be the reason why it's not merged yet?

mbhall88 commented 3 years ago

Ok, I have tested a pretty exhaustive set of parameters and can't quite recreate the v1.12 results on my dataset. However, despite the slight reduction in precision and recall, the improvements in CPU and memory usage are enough for me to take these losses.

jkbonfield commented 3 years ago

Do you have any specific examples that it found before and doesn't now? You can use "bcftools isec" (maybe needing multiple combinations) to see things that were found now and not before or vice versa, possibly then intersected a second time with a truth set if you have one.

There may be things we can further tweak.

mbhall88 commented 3 years ago

I don't have concrete examples of specific variants right now. Apologies, I am juggling trying to look at this and finish up my thesis at the same time. Once the thesis is done though I will go back and get some specific examples for you.

samtools / bcftools

regression of mpileup performance on Nanopore #1584