tukaani-project / xz

XZ Utils
https://tukaani.org/xz/
Other
532 stars 95 forks source link

Performance improvements on ARM64 #75

Closed parheliamm closed 8 months ago

parheliamm commented 8 months ago

Pull request checklist

Please check if your PR fulfills the following requirements:

Pull request type

Please check the type of change your PR introduces: - [ ] Bugfix - [ ] Feature - [ ] Code style update (formatting, renaming, typo fix) - [ ] Refactoring (no functional changes, no api changes) - [x] Build related changes - [ ] Documentation content changes - [x] Other (please describe): ## What is the current behavior?

Related Issue URL:

What is the new behavior?

Does this introduce a breaking change?

Other information

Performance improvements on ARM64 with 2 commits:

vanilla:
Compressor name         Compress. Decompress. Compr. size  Ratio Filename
memcpy                   5102 MB/s  5183 MB/s   211957760 100.00 silesia.tar
xz 5.2.5 -0              15.9 MB/s  51.6 MB/s    62579868  29.52 silesia.tar
xz 5.2.5 -1              11.8 MB/s  57.5 MB/s    58408297  27.56 silesia.tar
xz 5.2.5 -2              7.53 MB/s  59.9 MB/s    56708167  26.75 silesia.tar
xz 5.2.5 -3              5.02 MB/s  61.6 MB/s    55745576  26.30 silesia.tar
xz 5.2.5 -4              3.15 MB/s  62.7 MB/s    52106950  24.58 silesia.tar
xz 5.2.5 -5              2.29 MB/s  65.1 MB/s    49960648  23.57 silesia.tar
xz 5.2.5 -6              1.94 MB/s  65.5 MB/s    49196155  23.21 silesia.tar
xz 5.2.5 -7              1.83 MB/s  65.7 MB/s    48926731  23.08 silesia.tar
xz 5.2.5 -8              1.87 MB/s  66.4 MB/s    48768992  23.01 silesia.tar
xz 5.2.5 -9              1.85 MB/s  66.5 MB/s    48747544  23.00 silesia.tar

Patched:

Compressor name         Compress. Decompress. Compr. size  Ratio Filename
memcpy                   5209 MB/s  5265 MB/s   211957760 100.00 silesia.tar
xz 5.2.5 -0              16.4 MB/s  51.9 MB/s    62579868  29.52 silesia.tar
xz 5.2.5 -1              12.0 MB/s  57.8 MB/s    58408297  27.56 silesia.tar
xz 5.2.5 -2              8.26 MB/s  60.2 MB/s    56708167  26.75 silesia.tar
xz 5.2.5 -3              5.14 MB/s  61.6 MB/s    55745576  26.30 silesia.tar
xz 5.2.5 -4              3.29 MB/s  62.7 MB/s    52106950  24.58 silesia.tar
xz 5.2.5 -5              2.42 MB/s  64.9 MB/s    49960648  23.57 silesia.tar
xz 5.2.5 -6              2.03 MB/s  65.6 MB/s    49196155  23.21 silesia.tar
xz 5.2.5 -7              1.93 MB/s  65.6 MB/s    48926731  23.08 silesia.tar
xz 5.2.5 -8              1.94 MB/s  66.3 MB/s    48768992  23.01 silesia.tar
xz 5.2.5 -9              1.91 MB/s  66.4 MB/s    48747544  23.00 silesia.tar
done... (cIters=1 dIters=1 cTime=1.0 dTime=2.0 chunkSize=1706MB cSpeed=0MB)
Larhzu commented 8 months ago

I created a branch memcmplen_arm64 which should do the same as your first commit and also adds MSVC support on ARM64 (untested).

The commit message of your second commit has significantly higher numbers on the memcpy line. I'm not familiar with lzbench but I wonder if 10 % difference in memcpy could indicate that there was something different on the test computer and thus the benchmark results could be slightly different too. The difference in compression speed is small so it would be good to be sure that it's not due to noise.

In any case, the second patch cannot be accepted. Unfortunately you have misunderstood the problem with type punning. It's about the C programming language and how modern compilers optimize while still staying within the exact requirements of the C standard. Unsafe use of type punning breaks strict aliasing rules and might result in broken executables. The instruction set being used doesn't matter; even if unaligned access wasn't supported by the hardware, type punning would be problematic with modern compilers when accessing aligned data.

The memcpy method used by tuklib_integer.h should compile to a single instruction with modern GCC and Clang/LLVM versions when building for a target that supports fast unaligned access. Thus the use of type punning shouldn't make a difference on ARM64. However, it's possible that compilers do something slightly differently still and thus there could be a difference in practice, or the violation of aliasing rules allows compilers to do something that happens to work but could cause problems some day. It's a bit annoying situation but I don't know any better way.

Thanks!

parheliamm commented 8 months ago

I created a branch memcmplen_arm64 which should do the same as your first commit and also adds MSVC support on ARM64 (untested).

The commit message of your second commit has significantly higher numbers on the memcpy line. I'm not familiar with lzbench but I wonder if 10 % difference in memcpy could indicate that there was something different on the test computer and thus the benchmark results could be slightly different too. The difference in compression speed is small so it would be good to be sure that it's not due to noise.

In any case, the second patch cannot be accepted. Unfortunately you have misunderstood the problem with type punning. It's about the C programming language and how modern compilers optimize while still staying within the exact requirements of the C standard. Unsafe use of type punning breaks strict aliasing rules and might result in broken executables. The instruction set being used doesn't matter; even if unaligned access wasn't supported by the hardware, type punning would be problematic with modern compilers when accessing aligned data.

The memcpy method used by tuklib_integer.h should compile to a single instruction with modern GCC and Clang/LLVM versions when building for a target that supports fast unaligned access. Thus the use of type punning shouldn't make a difference on ARM64. However, it's possible that compilers do something slightly differently still and thus there could be a difference in practice, or the violation of aliasing rules allows compilers to do something that happens to work but could cause problems some day. It's a bit annoying situation but I don't know any better way.

Yes, you are correct, I disassemble the code, not only memcpy but also UNSAFE_TYPE_PUNNING are all interpreted as below:

 640:   f9400000    ldr x0, [x0]
 644:   d65f03c0    ret

So the 2nd patch is useless, we can keep the 1st patch only.

Thanks!

Larhzu commented 8 months ago

The 8-byte version is now enabled in memcmplen.h for ARM64 in the master branch. It will be included in XZ Utils 5.6.0. Thanks!