samtools / htslib

C library for high-throughput sequencing data formats
Other
809 stars 446 forks source link

Add CRAM SQ/M5 header checking when specifying a fasta file. #1522

Closed jkbonfield closed 1 year ago

jkbonfield commented 1 year ago

See https://github.com/samtools/samtools/issues/1748 for background, although this isn't fixing that issue and I'm not yet sure what that problem is.

I'm unsure on whether this is a desireable solution. See the benchmarks below; the "fix" isn't free.


Given the possibility of creating a CRAM which cannot be decoded again due to specifying a fasta file that differs to the M5 tags specified in the header, we're now more militant in enforcing these match.

This does however add a performance hit when processing the header, which can unfairly penalise small files or creation of CRAMs for a small portion of the genome.

# Old htslib 1.16 speed, enabled here by explicitly ignoring md5
$ time samtools view -T $HREF -O cram,ignore_md5 -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000
real    0m0.807s
user    0m0.702s
sys     0m0.101s

# New performance, dominated by validating SQ M5s
$ time samtools view -T $HREF -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000
real    0m20.312s
user    0m18.543s
sys     0m1.760s

# Using REF_PATH md5 cache instead is fast
$ time samtools view -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000
real    0m0.222s
user    0m0.204s
sys     0m0.016s

This significant start-up cost is why this check wasn't previously here. We need to decide between validation of correctness and performance. It's optional now, but perhaps ignore_md5 isn't the correct option (it affects other things) and perhaps it should be opt-in instead of opt-out.

daviesrob commented 1 year ago

That is a bit of a penalty, although it may be possible to optimise the loading code a bit.

How difficult would it be to only check the references at the point when they're used in alignment records? That would improve the case above, as you'd only need to check one chromosome. Although it would also mean the error could turn up quite a long way into writing the file, which isn't as ideal as discovering the problem right at the start...

jkbonfield commented 1 year ago

Taking the idea suggested of doing the checks on the fly has worked well. Thanks.

On the same file above, converting BAM to CRAM:

ignore_md5=0 using -t $HREF -i reference=$HREF                                  
        10M-10M         10M-20M         10M-50M                                 
real    0m0.886s        0m2.207s        0m5.403s                                
user    0m0.791s        0m2.062s        0m5.199s                                
sys     0m0.092s        0m0.144s        0m0.201s                                

ignore_md5=1                                                                    
real    0m0.612s        0m1.929s        0m5.115s                                
user    0m0.510s        0m1.779s        0m4.899s                                
sys     0m0.100s        0m0.148s        0m0.213s                                

ignore_md5=0 using REF_CACHE (md5)                                              
real    0m0.024s        0m1.278s        0m4.190s                                
user    0m0.017s        0m1.244s        0m4.068s                                
sys     0m0.006s        0m0.032s        0m0.121s                                

Ignoring the obvious HUGE win here of using a preformatted raw text file found via md5 cache (and mmap for I/O), the difference between validating md5sum and not is now negligible.

We also see the same big speed gain when doing embed_ref=2 on tiny regions, so there's that as an alternative to validating external refs still:

ignore_md5=0 embed_ref=2                                                        
        10M-10M         10M-20M         10M-50M                                 
real    0m0.025s        0m1.692s        0m5.474s                                
user    0m0.014s        0m1.578s        0m5.330s                                
sys     0m0.010s        0m0.072s        0m0.140s                                

It seems to work on name sorted data too (where it also validates, but once only per ref).

daviesrob commented 1 year ago

The revised version looks OK so far, apart maybe from the advice printed when you use the wrong reference:

./test/test_view -C -t /tmp/ce.fa -p /tmp/ce#1.cram test/ce#1.tmp.cram.sam_ 
[E::validate_md5] SQ header M5 tag discrepancy for reference 'CHROMOSOME_I'
[E::validate_md5] Please use the correct reference, or consider using embedded references

Trying embed_ref=1 on its own doesn't work:

./test/test_view -C -t /tmp/ce.fa -o embed_ref=1 -p /tmp/ce#1.er1.cram test/ce#1.tmp.cram.sam_ 
[E::validate_md5] SQ header M5 tag discrepancy for reference 'CHROMOSOME_I'
[E::validate_md5] Please use the correct reference, or consider using embedded references

embed_ref=2 does though, and produces a working cram file, as also does no_ref=1; or embed_ref=1 when combined with ignore_md5=1:

./test/test_view -C -t /tmp/ce.fa -o embed_ref=2 -p /tmp/ce#1.er2.cram test/ce#1.tmp.cram.sam_
./test/test_view -C -t /tmp/ce.fa -o no_ref=1 -p /tmp/ce#1.nr1.cram test/ce#1.tmp.cram.sam_
./test/test_view -C -t /tmp/ce.fa -o embed_ref=1 -o ignore_md5=1 -p /tmp/ce#1.er1.im1.cram test/ce#1.tmp.cram.sam_

Would it be safe to disable the check in embed_ref=1 mode?

jkbonfield commented 1 year ago

I decided to update the error message to suggest embed_ref=2. The problem with level 1 is that we're still stating that the external reference is the correct one, when we know there is a disparity between how the file was aligned and the reference on disk. The difference could be significant, including indels that cause every read to be off by 1.

Ignoring the md5sum and continuing could give us a very suboptimal cram, so it's best to be defensive here.