phoboslab / qoi

The “Quite OK Image Format” for fast, lossless image compression
MIT License
6.85k stars 327 forks source link

faster hash calculations nad other optimizations. #203

Closed kpckpc closed 1 year ago

kpckpc commented 2 years ago

Edit: I made a mistake. There is a 0.03% change of failing. An additional carry can corrupt the hash. This combination of colors apparently never happened in my testset.

Not reporting an issue, but wanted to drop some SWAR magic, to calculate the hash faster. Please note this is obeying the current hash function. This is not a proposal for a new hash function. Also probably not to include in the reference implementation, but might be used for people generating optimized versions.

#include <stdint.h>
#include <endian.h>
#if __BYTE_ORDER == __LITTLE_ENDIAN
  #define QOI_COLOR_HASH(C) \
    ((uint32_t) ((C.v & 0x3f3f3f3ful) * 0xc141c2cul - ((C.v & 0x200000ul) << 5)) >> 26)
#else
  #define QOI_COLOR_HASH(C) \
    ((uint32_t) ((C.v & 0x3f3f3f3ful) * 0x2c1c140cul) >> 26)
#endif

The magic constant for little endian being: (11|(7<<8)|(5<<16)|(3<<24))<<2, for big endian, the factors reversed; All byte products end up in the msb. The additional shift of 2 cause the two high bits to drop off (i.e. a free modulo 64) For little endian, there is 1 bit of spillover from the blue channel that needs to be corrected, hence the subtraction.

The generated assembly for the hash on x86_64 (the compiler first does the shift and then the masking)

        and     eax, 1061109567
        imul    eax, eax, 202644524
        shl     edi, 3            \
        and     edi, 1621156192    } 3 additional instructions to correct for spillover
        sub     eax, edi          /
        shr     eax, 26

I have tried all kinds of optimizations, to speed it up, including SSE/AVX tricks, to detect rows of rgb(a) opcodes, and to perform hash calculations in parallel, but gcc -Ofast -march=native still outsmarts me with a pretty clean gcc code with the following changes. the following resulted in a roughly 180% increase:

As said, just wanted to drop this. Definitely not intended for inclusion in the reference version. This should remain readable.