samtools / htslib

C library for high-throughput sequencing data formats
Other
789 stars 447 forks source link

Speed up load_ref_portion. #1551

Closed jkbonfield closed 1 year ago

jkbonfield commented 1 year ago

This function is about 7x faster than before, which speeds up low-depth CRAM decoding by around 10% of so. Obviously the time spent in this function is constantly regardless of depth, so the deeper the data the less important the speed of this becomes.

The two main improvements are:

  1. Drop toupper_c and replace it with "c & ~0x20". This works for ASCII, and we already have far too many places with char lookup tables e.g. converting ACGT to 0123 that we're not going to work on mythical EBDIC systems anyway.

  2. Remove the continuous white-space check. We exploit the knowledge that the FASTA format must have white-space only at the end of lines. The fai index can't work if this isn't true and I've already tested that samtools faidx fails to query correctly if we have whitespace elsewhere.

Some benchmarks are below. I can't explain why mmap is being slow on this system (seq4c). It's not what I've observed before, where mmap is normally the fastest way to load the reference.

10 million records 9827_1#49 at a mean depth of ~1.75x Cram_io.c built with clang10 (although rest was probably system gcc).

Reference via mmap is:

./test/test_view -B /tmp/_.cram
       58752578630      cycles                    #    2.984 GHz
      115846154170      instructions              #    1.97  insn per cycle
  20.41%         16536  test_view        [.] rans_uncompress_O1
> 10.55%          8560  [kernel]         [k] _etext
   8.06%          6532  test_view        [.] cram_decode_slice
   7.85%          6356  test_view        [.] RansDecRenorm2
   5.98%          4895  test_view        [.] body
   5.70%          4622  test_view        [.] cram_decode_seq.isra.11
   4.71%          3860  libz.so.1.2.11   [.] crc32_z
   4.54%          3744  libz.so.1.2.11   [.] inflateBackEnd
   4.08%          3289  test_view        [.] bam_set1

Not sure what "_etext" is, but it's a significant CPU portion.

Develop branch:

./test/test_view -i reference=$HREF -B /tmp/_.cram
       58397677857      cycles                    #    2.982 GHz
      122583155370      instructions              #    2.10  insn per cycle
  20.37%         16483  test_view        [.] rans_uncompress_O1
>  8.03%          6482  test_view        [.] load_ref_portion
   7.99%          6489  test_view        [.] cram_decode_slice
   7.66%          6201  test_view        [.] RansDecRenorm2
   5.77%          4715  test_view        [.] body
   5.71%          4638  test_view        [.] cram_decode_seq.isra.11
   5.40%          4381  libz.so.1.2.11   [.] crc32_z
   4.28%          3454  test_view        [.] bam_set1
   3.62%          2943  test_view        [.] cram_external_decode_block
   3.57%          2901  libz.so.1.2.11   [.] inflateBackEnd
   3.24%          2634  test_view        [.] cram_byte_array_stop_decode_block
   3.21%          2606  libc-2.27.so     [.] __memmove_sse2_unaligned_erms
   2.78%          2257  test_view        [.] cram_external_decode_int
   2.48%          2016  test_view        [.] cram_byte_array_len_decode
   2.10%          1705  test_view        [.] safe_itf8_get
   2.00%          1618  test_view        [.] rans_uncompress_O0
>  1.72%          1405  [kernel]         [k] _etext

"_etext" plummets, so it's something related to the mmap, but it's been replaced by a heavy load_ref_portion instead.

Old dev loop, but using &~0x20 instead of toupper_c. ./test/testview -i reference=$HREF -B /tmp/.cram

       54701579549      cycles                    #    2.983 GHz
      119136523274      instructions              #    2.18  insn per cycle
  21.35%         16066  test_view        [.] rans_uncompress_O1
   8.83%          6669  test_view        [.] cram_decode_slice
   8.50%          6395  test_view        [.] RansDecRenorm2
   6.23%          4749  test_view        [.] body
   6.23%          4708  test_view        [.] cram_decode_seq.isra.11
   5.82%          4391  libz.so.1.2.11   [.] crc32_z
   4.44%          3329  test_view        [.] bam_set1
   3.86%          2917  test_view        [.] cram_external_decode_block
   3.83%          2892  libz.so.1.2.11   [.] inflateBackEnd
>  3.59%          2699  test_view        [.] load_ref_portion
   3.34%          2528  test_view        [.] cram_byte_array_stop_decode_block
   3.33%          2511  libc-2.27.so     [.] __memmove_sse2_unaligned_erms
   2.98%          2250  test_view        [.] cram_external_decode_int
   2.67%          2015  test_view        [.] cram_byte_array_len_decode
   2.30%          1736  test_view        [.] safe_itf8_get
   2.16%          1620  test_view        [.] rans_uncompress_O0
>  1.87%          1432  [kernel]         [k] _etext

load_ref_portion dropped from 6482 to 2699.

New loop construction (this PR):

./test/test_view -i reference=$HREF -B /tmp/_.cram
       53294682517      cycles                    #    2.982 GHz
      114450133099      instructions              #    2.15  insn per cycle
  22.21%         16428  test_view        [.] rans_uncompress_O1
   8.66%          6432  test_view        [.] cram_decode_slice
   8.45%          6248  test_view        [.] RansDecRenorm2
   6.54%          4862  test_view        [.] cram_decode_seq.isra.11
   6.34%          4750  test_view        [.] body
   5.95%          4412  libz.so.1.2.11   [.] crc32_z
   4.53%          3343  test_view        [.] bam_set1
   3.99%          2968  test_view        [.] cram_external_decode_block
   3.95%          2928  libz.so.1.2.11   [.] inflateBackEnd
   3.54%          2635  test_view        [.] cram_byte_array_stop_decode_block
   3.33%          2465  libc-2.27.so     [.] __memmove_sse2_unaligned_erms
   2.96%          2201  test_view        [.] cram_external_decode_int
   2.80%          2079  test_view        [.] cram_byte_array_len_decode
   2.40%          1786  test_view        [.] safe_itf8_get
   2.22%          1643  test_view        [.] rans_uncompress_O0
   1.84%          1363  libc-2.27.so     [.] __memset_sse2_unaligned_erms
>  1.77%          1328  [kernel]         [k] _etext
>  1.29%           951  test_view        [.] load_ref_portion

load_ref_portion dropped again from 2699 to 951.