Closed jkbonfield closed 1 year ago
That is a bit of a penalty, although it may be possible to optimise the loading code a bit.
How difficult would it be to only check the references at the point when they're used in alignment records? That would improve the case above, as you'd only need to check one chromosome. Although it would also mean the error could turn up quite a long way into writing the file, which isn't as ideal as discovering the problem right at the start...
Taking the idea suggested of doing the checks on the fly has worked well. Thanks.
On the same file above, converting BAM to CRAM:
ignore_md5=0 using -t $HREF -i reference=$HREF
10M-10M 10M-20M 10M-50M
real 0m0.886s 0m2.207s 0m5.403s
user 0m0.791s 0m2.062s 0m5.199s
sys 0m0.092s 0m0.144s 0m0.201s
ignore_md5=1
real 0m0.612s 0m1.929s 0m5.115s
user 0m0.510s 0m1.779s 0m4.899s
sys 0m0.100s 0m0.148s 0m0.213s
ignore_md5=0 using REF_CACHE (md5)
real 0m0.024s 0m1.278s 0m4.190s
user 0m0.017s 0m1.244s 0m4.068s
sys 0m0.006s 0m0.032s 0m0.121s
Ignoring the obvious HUGE win here of using a preformatted raw text file found via md5 cache (and mmap for I/O), the difference between validating md5sum and not is now negligible.
We also see the same big speed gain when doing embed_ref=2 on tiny regions, so there's that as an alternative to validating external refs still:
ignore_md5=0 embed_ref=2
10M-10M 10M-20M 10M-50M
real 0m0.025s 0m1.692s 0m5.474s
user 0m0.014s 0m1.578s 0m5.330s
sys 0m0.010s 0m0.072s 0m0.140s
It seems to work on name sorted data too (where it also validates, but once only per ref).
The revised version looks OK so far, apart maybe from the advice printed when you use the wrong reference:
./test/test_view -C -t /tmp/ce.fa -p /tmp/ce#1.cram test/ce#1.tmp.cram.sam_
[E::validate_md5] SQ header M5 tag discrepancy for reference 'CHROMOSOME_I'
[E::validate_md5] Please use the correct reference, or consider using embedded references
Trying embed_ref=1
on its own doesn't work:
./test/test_view -C -t /tmp/ce.fa -o embed_ref=1 -p /tmp/ce#1.er1.cram test/ce#1.tmp.cram.sam_
[E::validate_md5] SQ header M5 tag discrepancy for reference 'CHROMOSOME_I'
[E::validate_md5] Please use the correct reference, or consider using embedded references
embed_ref=2
does though, and produces a working cram file, as also does no_ref=1
; or embed_ref=1
when combined with ignore_md5=1
:
./test/test_view -C -t /tmp/ce.fa -o embed_ref=2 -p /tmp/ce#1.er2.cram test/ce#1.tmp.cram.sam_
./test/test_view -C -t /tmp/ce.fa -o no_ref=1 -p /tmp/ce#1.nr1.cram test/ce#1.tmp.cram.sam_
./test/test_view -C -t /tmp/ce.fa -o embed_ref=1 -o ignore_md5=1 -p /tmp/ce#1.er1.im1.cram test/ce#1.tmp.cram.sam_
Would it be safe to disable the check in embed_ref=1
mode?
I decided to update the error message to suggest embed_ref=2
. The problem with level 1 is that we're still stating that the external reference is the correct one, when we know there is a disparity between how the file was aligned and the reference on disk. The difference could be significant, including indels that cause every read to be off by 1.
Ignoring the md5sum and continuing could give us a very suboptimal cram, so it's best to be defensive here.
See https://github.com/samtools/samtools/issues/1748 for background, although this isn't fixing that issue and I'm not yet sure what that problem is.
I'm unsure on whether this is a desireable solution. See the benchmarks below; the "fix" isn't free.
Given the possibility of creating a CRAM which cannot be decoded again due to specifying a fasta file that differs to the M5 tags specified in the header, we're now more militant in enforcing these match.
This does however add a performance hit when processing the header, which can unfairly penalise small files or creation of CRAMs for a small portion of the genome.
This significant start-up cost is why this check wasn't previously here. We need to decide between validation of correctness and performance. It's optional now, but perhaps ignore_md5 isn't the correct option (it affects other things) and perhaps it should be opt-in instead of opt-out.