Slightly speed up various cram decoding functions

jkbonfield commented 1 year ago

None of this is huge, but it all adds up.

bam_set1 has been refactored so -O3 is more likely to do unrolling and vectorisation.

    // Old          time   inst        cyc
    // gcc -O2      12.36  78936832183 36853852204
    // gcc -O3      12.37  78713347525 36867027825
    // clang13 -O2  12.43  77451926728 37012866717
    // clang13 -O3  12.32  77627221907 36691623424
    // gcc12 -O2    12.43  78895089091 37081260172
    // gcc12 -O3    12.36  78505904437 36829216967

    // New
    // gcc -O2      12.47  78832021505 37200597109 +
    // gcc -O3      12.14  76499369401 36390334338 --
    // clang13 -O2  12.38  76678460761 36920111561 ~
    // clang13 -O3  12.26  76678023071 36548488492 ~
    // gcc12 -O2    12.38  78581694397 36880034181 -
    // gcc12 -O3    12.15  76356625541 36293921439 --

Improve the MD/NM generation in CRAM decoding. With decode_md=1 (default) by decode changed from 12.91s to 12.57s With decode_md=0 it's 11.92, so that's 1/3rd of the overhead removed.
Changed the block_resize to resize in slightly smaller chunks and to use integer maths.
Reduce excessive pointer redirection in cram_decode_seq.

Unsure if this speeds things up much (sometimes it seems to), but it provides tidier code too.

Combined before and after on 10 million NovaSeq CRAM (v3.1)

epyc 7543

               before   after
gcc(7)  -O2    7.67     7.63   -0.5%
gcc12   -O2    7.59     7.60   +0.1%
clang7  -O2    8.12     7.57   -6.8%
clang13 -O2    8.06     7.54   -6.5%

gcc(7)  -O3    7.73     7.46   -3.5%
gcc12   -O3    7.46     7.35   -1.5%
clang7  -O3    8.08     7.57   -6.3%
clang13 -O3    7.95     7.66   -3.6%

Xeon Gold 6142

               before   after
gcc(7)  -O2    9.74     9.14   -6.2%
gcc12   -O2    9.43     8.45  -10.4%
clang7  -O2    9.61     8.64  -10.0%
clang13 -O2    9.95     8.85  -11.1%

gcc(7)  -O3    9.51     8.81   -7.4%
gcc12   -O3    9.15     8.42   -8.0%
clang7  -O3    9.92     8.72  -12.1%
clang13 -O3    9.68     8.91   -8.0%

Biggest change is with clang, but also on Intel we see bigger changes than AMD too.

jkbonfield commented 1 year ago

Extra data for other data sets (including duplicating Novaseq data from above). I stuck with a clang 13 -O2 and one CPU rather than testing everything, as that combination seemed both realistic and showed a considerable benefit. Pleasing to see it applies well on other data too.

Xeon Gold 6142, clang13 -O2, diff data sets
novaseq        9.95 8.85  -11.1%
revio         23.33    19.46  -16.6%
ultima       195.20   177.71   -9.0%
ONT       68.27    60.67  -11.1%

jkbonfield commented 1 year ago

Working on fixing it! Turns out my trivial 2 line SAM file for testing wasn't exactly enough. :/

samtools / htslib

Slightly speed up various cram decoding functions #1580