Consolidate SSE optimized shuffle with Blosc.

kiyo-masui commented 10 years ago

Both Bitshuffle and Blosc/c-blosc@b37ca0bf820f104debde78f6267e5e7ea69e6c54 implement optimized (SSE2) versions of shuffle for 16, 32, and 64 bit element sizes. In Bitshuffle these routines are bshuf_trans_byte_elem_*. The operation counts for the two implementations appear to be the same be we should check which versions are fastest and consolidate them.

Bitshuffle also has optimized code for if the elements size is a multiple of 32 or 64 bits, which is useful for compound data types and could benefit Blosc.

kif commented 4 years ago

To emphasize on this, the bitunshuffle uses an extra temporary buffer betweed the byte-transpose and the bitwise transpose. I believe this reduces drastically the performances when the buffer size increases https://github.com/kiyo-masui/bitshuffle/blob/master/src/bitshuffle_core.c#L1297

I did some alternative implementation which use a limited number of vectors (8, 16, 32 or 64) for temporary storage. This allows to make best use of the L1 cache instead of going to the L3 cache which is usually shared between the cores.

kiyo-masui commented 4 years ago

Keep in mind that the whole operation is blocked in small blocks of 8k bytes, which does fit in cache.

kif commented 4 years ago

On Fri, 08 Nov 2019 05:55:13 -0800 Kiyoshi Masui notifications@github.com wrote:

Keep in mind that the whole operation is blocked in small blocks of 8k bytes, which does fit in cache.

I understund that this 8kB limit is indeed imposed by the L1 cache size (~32kB/3).

My benchmarks may be biased but apparently with larger chunk size (~L3/3) the performances are reaching a plateau without drop as observed in some cases: https://user-images.githubusercontent.com/1018880/68398473-15658e80-0175-11ea-9225-257448de9104.png https://www.hdfgroup.org/wp-content/uploads/2019/09/2019-09-20-Power9HDF5.pdf slide 18.

I have the feeling the SSE2 and AVX implementations could also win with this approach.

Bitshuffling may be "neglectable" compared to the cost of lz4 on x86 hardware, but the Power9 has hardware compression for gzip and shuffling bits is then clearly the limiting factor.

kiyo-masui commented 4 years ago

Looking at your slides, you are testing the version of bitshuffle that ships with blosc? Can you post some test code?

kif commented 4 years ago

Looking at your slides, you are testing the version of bitshuffle that ships with blosc?

yes, blosc2 beta to be precise.

Can you post some test code?

For the blosc code, I have to expose manually the different function which I want to test and call them with ctypes. For your package, it is not needed as it is directly exposed in python via cython. In both cases, the overhead due to python remains an unknown, I only know it is much higher in ppc64le than in x86 (probably 2x).

I had a look at the two SSE2 code for bitunshuffle and to me, they differ only in a malloc for the temporary buffer.

Here is the benchmarking code I used, taking benefit of the jupyter kernel facility:

def benchmark(mini=10, maxi=25, dtype="uint8"):
    res = {}
    for i in range(mini, maxi):
        size = 1<<i
        data = numpy.random.randint(0, 255, size=size).astype(dtype)
        bs = %timeit -o bitshuffle.bitshuffle(data, data.nbytes)
        bu = %timeit -o bitshuffle.bitunshuffle(data, data.nbytes)
        res[size] = (data.nbytes/bs.best/1e9, data.nbytes/bu.best/1e9)
    return res

kiyo-masui / bitshuffle

Consolidate SSE optimized shuffle with Blosc. #11