samtools / htslib

C library for high-throughput sequencing data formats
Other
805 stars 446 forks source link

Fix cram_index_query_last function #1574

Closed jkbonfield closed 1 year ago

jkbonfield commented 1 year ago

The cram_index_query_last used by sam_itr_regarray had problems when dealing with slices whose region is contained within another region.

Fundamentally this is due to the cram_index arrays being contiguous in memory until a containment is found, at which point the pointers will be to an entirely different array. This breaks naive pointer comparisons.

The cram_index struct already had a "next" field holding the file offset of the next container. This has been replaced by e_next pointing to the next cram_entry struct in file ordering, and e_next->offset is equivalent to the old "next". This allows consumption of the index either as the original nested containment list or as a traditional linked list.

Also fixed cram_index_query with from != NULL, which similarly was incorrect before. We never used this function and it's not public, but we now use it within the rewrite of cram_index_query_last.

Fixes #1569

Some testing on a local CRAM file, with nested containers (nested containment list in index) and some multi-ref containers present too.

                Real(s)   User(s)       No.Seqs
1.7:
BAM @8          0m12.972s 0m54.406s     1101651 WRONG
CRAM-1k @8      theaded fails with CRC err
" no threads    0m31.657s 0m28.115s     1100721 WRONG

1.9:
BAM @8          0m10.607s 0m33.546s     1101651 WRONG
CRAM-1k @8      1m13.605s 7m5.137s      17031   VERY WRONG!
" no threads    0m42.706s 0m40.298s     1100731 WRONG

1.10.2:
BAM @8          0m11.040s 0m34.387s     1101852 (first BAM to be correct)
CRAM-1k @8      1m38.255s 7m44.389s     1100922 WRONG
" no threads    0m28.594s 0m26.180s     1100922 WRONG

1.11:
CRAM-1k @8      0m24.957s 0m46.595s     1586311 VERY WRONG
" no threads    0m39.835s 0m34.317s     1586311 VERY WRONG

1.17:
CRAM-1k @8      0m26.114s 0m52.064s     1586311 VERY WRONG
" no threads    0m42.399s 0m36.827s     1586311 VERY WRONG

this PR:
BAM @8          0m9.148s  0m40.372s     1101852
" no threads    0m9.414s  0m9.201s      1101852
CRAM-1k @8      0m23.206s 0m43.078s     1101852
" no threads    0m25.381s 0m23.155s     1101852

It's surprising to see the earlier releases given very many more results back from this bunch of queries.

For reference for future code archaeologists, my test was:

time for r in `seq 1 10`;do echo -n "$r ";samtools view -H 17537_1#4.bam|perl -lane 'BEGIN{srand('$r')} next unless /^@SQ/;$F[1]=~s/^SN://;$F[2]=~s/^LN://;$l1 = int(rand($F[2]/4));$l2 = int(rand(10000));for ($i=rand()<0.2?$l2:$l1;$i<$F[2];$i+=rand()<0.7?$l2:$l1) {$en=$i+int(rand(1000));$en=$F[2] if $en>$F[2];print "$F[1]:$i-$en";$l1 = int(rand($F[2]/10)+50000);$l2 = int(rand(1000));}' > _reg;./test/test_view -M 17537_1#4.s1k.cram `cat _reg` | egrep -cv '^@';done | awk '{a+=$2;print} END {print "total",a}'

The .s1k.cram was built with -O cram,seqs_per_slice=1000 to force more container junctions. I also tested 10k and 100 seq per slice plus 3 slices per container, which stressed the previous version even greater.

jkbonfield commented 1 year ago

See also https://github.com/samtools/htslib/pull/1064 which was a previous attempt to fix the few missing records comparing BAM to CRAM. This is what caused the big spike of extra records for 1.11 onwards, while apparently not fixing all missing ones.

That was despite very extensive testing. So please also do your own testing!

I'm trying to construct some synthetic stress tests, but so far everything works on them which has me scratching my head...

daviesrob commented 1 year ago

Oops, posted this on the wrong tab...

This looks OK so far, although enabling the dump_index() code to see what's going on shows some rather deep nesting on the unmapped reads:

-1 / -2147483648 .. 2147483647,                   offset 0 0x27755d0 (nil)
    -1 / 0 .. 149,                                offset 15726314885 0x27523b0 0x2752740
        -1 / 0 .. 149,                            offset 15727086877 0x2752740 0x2752ad0
            -1 / 0 .. 149,                        offset 15727730588 0x2752ad0 0x2752e60
                -1 / 0 .. 149,                    offset 15728372279 0x2752e60 0x27531f0
                    -1 / 0 .. 149,                offset 15729011232 0x27531f0 0x2753580
                        -1 / 0 .. 149,            offset 15729651684 0x2753580 0x2753910
                            -1 / 0 .. 149,        offset 15730292336 0x2753910 0x2753ca0
                                -1 / 0 .. 149,    offset 15730940888 0x2753ca0 0x2754030
                                    -1 / 0 .. 149, offset 15731618936 0x2754030 0x27543c0
                                        -1 / 0 .. 149, offset 15732293423 0x27543c0 0x2754750
                                            -1 / 0 .. 149, offset 15732964238 0x2754750 0x2754ae0
                                                -1 / 0 .. 149, offset 15733646463 0x2754ae0 0x2754e70
                                                    -1 / 0 .. 149, offset 15734330256 0x2754e70 0x2755200
                                                        -1 / 0 .. 149, offset 15734980719 0x2755200 0x2755590
                                                            -1 / 0 .. 149, offset 15735564531 0x2755590 0x2755920
                                                                -1 / 0 .. 149, offset 15736146386 0x2755920 0x2755cb0
                                                                    -1 / 0 .. 149, offset 15736729893 0x2755cb0 0x2756040
                                                                        -1 / 0 .. 149, offset 15737333674 0x2756040 0x1ee46e0

This isn't a problem in this case, and I guess it would be OK for cram_index_query() as you can only ask to go to the start of the unmapped data. But if you accidentally made a file with a lot of unmapped data I guess you could end up getting some deep recursion in link_index_(), not to mention a lot of allocated cram_index::e structs. Do you think this could be a problem, and is there anything that could easily be done about it?

jkbonfield commented 1 year ago

Corrected the index loading code to detect indices with invalid end coords for unmapped data. This avoids the deep nesting reported above.

I cannot work out what produced that broken index though. None of the tools I've tested (going back into ancient history) are doing so. There's no PG meta-data in the file either, so it's a mystery.