kiyo-masui / bitshuffle

Filter for improving compression of typed binary data.
Other
215 stars 76 forks source link

[arm64] use a better translation for move_mask #140

Open sebpop opened 1 year ago

sebpop commented 1 year ago

No changes | With the patch | Speedup $ python3 ./tests/test_ext.py | | .bitshuffle 64 : 4.94 s/GB, 0.20 GB/s | 1.53 s/GB, 0.65 GB/s | 3.25x .bitunshuffle 64 : 5.09 s/GB, 0.20 GB/s | 1.53 s/GB, 0.65 GB/s | 3.25x .compress 64 : 5.26 s/GB, 0.19 GB/s | 1.80 s/GB, 0.55 GB/s | 2.89x .compress zstd 64 : 8.02 s/GB, 0.12 GB/s | 4.80 s/GB, 0.21 GB/s | 1.75x .decompress 64 : 5.72 s/GB, 0.17 GB/s | 2.21 s/GB, 0.45 GB/s | 2.64x .decompress zstd 64 : 5.71 s/GB, 0.18 GB/s | 2.18 s/GB, 0.46 GB/s | 2.55x

stdpain commented 1 year ago

It seems it have some problem when apply this patch. here was a reproduce case: https://github.com/stdpain/bitshuffle/tree/reproduce/reproduce

stdpain commented 1 year ago

It looks like if you compile with -O0 you will get the correct result. Hope it can help you

sebpop commented 1 year ago

Hello @stdpain, thanks a lot for the testcase. I have tried to reproduce the error, however I am seeing the exact same output when I compile with and without my patch, on a Graviton3 c7g machine that is running Debian. I also have tried both gcc-9.3.0 and clang-11.0.1 and both compilers produce the same output.

$ clang --version
Debian clang version 11.0.1-2
Target: aarch64-unknown-linux-gnu

$ gcc --version
gcc (Debian 9.3.0-22) 9.3.0

Could you please report which gcc version you are using? I will try to see if I can see the error with your compiler version.

sebpop commented 1 year ago

I was able to reproduce the bug with gcc10 on AL2. The output is incorrect:

$ make
cd .. && gcc10-gcc src/bitshuffle_core.c src/bitshuffle.c src/iochain.c lz4/lz4.c -O3 -march=native -c -std=c99 -I lz4/ && ar rs libbitshuffle.a ./*.o && cd src
gcc10-g++ test.cpp -I ../src ../libbitshuffle.a
spop-arm64 - 03:35:04:~/bitshuffle/reproduce$ ./a.out 
16384
0
0
0
0
39
45
51
57
[...]

I would suggest using gcc7 from AL2 until I fix the issue in gcc10.

sebpop commented 1 year ago

gcc-9 is the last version of gcc that works. gcc-10 and all versions of gcc after 10 are broken. gcc-trunk as of today is still broken. I will fix the bug in gcc-trunk and do the back-ports to all active branches.

sebpop commented 1 year ago

I opened a bug against gcc-10: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109519

sebpop commented 1 year ago

The compiler does not have a bug. I added d29228f to this pull request to fix the issue in the code of bitshuffle, following the recommendations from Andrew Pinski.

stdpain commented 1 year ago

It works when compile with -fno-strict-aliasing

stdpain commented 1 year ago

patch d29228f also works.

stdpain commented 1 year ago

@sebpop It seems we could have a better implements for neonmovemask_bulk https://gist.github.com/geofflangdale/99393863c8cae3e83195a5e592e7dc82

uint8x16_t t0 = vbslq_u8(vdupq_n_u8(0x55), p0, p1); // 01010101...
uint8x16_t t1 = vbslq_u8(vdupq_n_u8(0x55), p2, p3); // 23232323...
uint8x16_t combined = vbslq_u8(vdupq_n_u8(0x33), t0, t1); // 01230123...
int8x8_t sum = vshrn_n_s16(vreinterpretq_s16_u8(combined), 4);
return vget_lane_u64(vreinterpret_u64_s8(sum), 0);
stdpain commented 1 year ago

@sebpop It seems we could have a better implements for neonmovemask_bulk https://gist.github.com/geofflangdale/99393863c8cae3e83195a5e592e7dc82 uint8x16_t t0 = vbslq_u8(vdupq_n_u8(0x55), p0, p1); // 01010101... uint8x16_t t1 = vbslq_u8(vdupq_n_u8(0x55), p2, p3); // 23232323... uint8x16_t combined = vbslq_u8(vdupq_n_u8(0x33), t0, t1); // 01230123... int8x8_t sum = vshrn_n_s16(vreinterpretq_s16_u8(combined), 4); return vget_lane_u64(vreinterpret_u64_s8(sum), 0);

ignore

sebpop commented 1 year ago

Ping patch.