samtools / htslib

C library for high-throughput sequencing data formats
Other
784 stars 447 forks source link

Enable optimisation level -O3 for SAM QUAL+33 formatting. #1679

Closed jkbonfield closed 8 months ago

jkbonfield commented 9 months ago

On long read data, the time to format SAM files is dominated by sequence and quality.

The qual[i]+33 loop to turn binary quals into printable ASCII is not vectorised by GCC without using -O3. I would consider this a weakness of the compiler, but nothing I've done has persuaded gcc (before v12) to generate vector instructions. Not even the "restrict" keyword.

Hence using attribute((optimize("O3"))).

The time the new add33 function is approx 15x quicker with gcc -O3 than gcc -O2. Clang's and icc's default optimisation level gives speeds comparable to the gcc -O3.

With a compressed Illumina BAM this was just 3% overall speed gain to decode to SAM. The extreme opposite is uncompressed ONT BAM which shows a 23% speed gain.

jkbonfield commented 9 months ago

Note my original implementation of this had a prototype of

static inline void HTS_OPT3 add33(uint8_t *restrict a, const uint8_t *restrict b, int32_t len)        

However our Rocky Linux CI test explicitly turns off C99 support via --std=gnu90, hence restrict is not permitted. As far as I'm aware we target C99 (it is 20+ year old after all). What's the reason behind explicitly forbidding C99 extensions on Rocky Linux? I know it's an old compiler, but sure not over 20 year old!

That said, it makes no difference here. The use of restrict was a failed attempt to get gcc to behave, but it was resolute in wanting to avoid all forms of vectorisation without explicitly enabling them via -ftree-loop-vectorize (or -O3 which adds that).

rhpvorderman commented 9 months ago

Wow, this is very instructive. Goodbye #ifdef __SSE2__! Thanks for showing me this!

daviesrob commented 9 months ago

The -std=gnu90 test was added in #1524 to catch cases of for (int n = 0, ...), which is not supported by gnu90. This (or gnu89) was the default on the GCC-4 series, which was shipped in Red Hat Enterprise Linux versions 5 to 7. RHEL7 is still (just) in production, and will not be completely retired until June 2026. It's also possible that some BSDs ship ancient GCCs, although they seem to be migrating to clang.

It's possible we could allow more c99-only features with suitable configure tests to set the -std=c99 option, but we'd have to check that the code still worked on RedHat, at least.

We could also add a configure test to set the optimisation to -O3, instead of just doing it for one very small piece of code...

jkbonfield commented 9 months ago

I don't understand though why we're explicitly attempting to forbid the very useful "for (int i = ..." notation. It's standard in C99.

I get that the earlier RedHat's don't default to supporting this, but they do still support it: https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Standards.html#Standards

All it requires is that people build with CC="gcc --std=gnu99" and it'd be accepted. I don't think this is an onerous problem, and it could be part of autoconf anyway. I'm happy with -O3 being the default and I'm sure I've suggested this in the past (but at the time wasn't able to be sufficiently convincing apparently).

I will say however that I think gcc are in error here. O3 is explicitly saying we favour speed over everything else, and aggressive unrolling of loops etc that significantly increases code size is worth it even if it's only a small speed gain. Basically it's the "turn it up to 11" level of optimisation. The vectorisation of this trivial loop is in a totally different class. It's an order of magnitude faster! It's not some minor speed gain vs big code size tradeoff unless you make the assumption that it's only executed with a very low number of cycles (and gcc has no way to hint at that, unlike several other compilers). Everyone else seems in agreement that vectorisation is a good thing to do even at earlier optimisation levels (such as the O2 offered by default from autoconf).