I wrote a NEON-optimized version of a function that computes the GNU hash value for a symbol name, and Clang's version of the function is slower than what GCC generates (or what I can do with hand-written assembly).
I'm not quite sure what LLVM is doing that's making it slower. I did notice that my hand-written assembly doesn't create a stack frame, whereas both GCC and Clang need one.
Details:
I'm working on making the Bionic dynamic linker's GNU hash calculation faster, because it takes a significant portion of the total linker run-time. (At one point, I measured it taking 20% of the total run-time doing the initial linking of cameraserver.)
The linker currently uses a simple function to calculate the hash.
uint32_t SymbolName::gnu_hash() {
if (!has_gnuhash) {
uint32_t h = 5381;
const uint8_t name = reinterpret_cast<const uint8_t>(name_);
while (name != 0) {
h += (h << 5) + name++; // h33 + c = h + h 32 + c = h + h << 5 + c
}
gnu_hash_ = h;
has_gnu_hash_ = true;
}
return gnuhash;
}
Using hand-written arm32 Neon assembly, I wrote a function that takes 30-50% less time than the simple C++ version. Using C++ code with Neon intrinsics instead, I can write something that's still faster than the simple C++ version, but has about half the improvement when I compile with Clang. GCC, on the other hand, gets much closer to my hand-written assembly.
Here are some numbers on an arm32-only Go phone. I used the "performance" scaling governor. I used the https://tratt.net/laurie/src/multitime utility to run benchmarks repeatedly and calculate confidence intervals.
Clang, simple C function: 0.441+/-0.0001 (in seconds of wall clock time)
GCC, simple C function: 0.376+/-0.0001
Clang, using Neon intrinsics: 0.373+/-0.0001 (Clang ignored pragma unroll)
GCC, using Neon intrinsics: 0.330+/-0.0001 (w/ no pragma GCC unroll)
GCC, using Neon intrinsics: 0.312+/-0.0003 (w/ pragma GCC unroll 8)
Handwritten assembly: 0.311+/-0.0001
I also looked at a walleye Pixel 2 device (core 4, one of the fast ones). For arm32:
Clang, simple C function: 0.347+/-0.0023
GCC, simple C function: 0.323+/-0.0021
Clang, using Neon intrinsics: 0.225+/-0.0013
GCC, using Neon intrinsics: 0.208+/-0.0013 (w/ no pragma GCC unroll)
GCC, using Neon intrinsics: 0.186+/-0.0007 (w/ pragma GCC unroll 8)
Handwritten assembly: 0.176+/-0.0013
I don't have handwritten assembly for arm64, but I benchmarked the C++ code.
Clang, simple C function: 0.308+/-0.0017
GCC, simple C function: 0.285+/-0.0018
Clang, using Neon intrinsics: 0.205+/-0.0016 (Clang ignored pragma unroll)
GCC, using Neon intrinsics: 0.189+/-0.0010 (w/ no pragma GCC unroll)
GCC, using Neon intrinsics: 0.217+/-0.0015 (w/ pragma GCC unroll 4)
GCC, using Neon intrinsics: 0.214+/-0.0004 (w/ pragma GCC unroll 8)
I attached a tarball with the source code, Makefile, and a couple of scripts for running the benchmarks via adb.
I also uploaded three assembly files:
my hand-crafted arm32 assembly
the output from NDK r21 beta 1's compiler (Clang as of r365631)
the output from arm-linux-gnueabi-gcc-8 8.3.0 from my gLinux machine
Extended Description
I wrote a NEON-optimized version of a function that computes the GNU hash value for a symbol name, and Clang's version of the function is slower than what GCC generates (or what I can do with hand-written assembly).
I'm not quite sure what LLVM is doing that's making it slower. I did notice that my hand-written assembly doesn't create a stack frame, whereas both GCC and Clang need one.
Details:
I'm working on making the Bionic dynamic linker's GNU hash calculation faster, because it takes a significant portion of the total linker run-time. (At one point, I measured it taking 20% of the total run-time doing the initial linking of cameraserver.)
The linker currently uses a simple function to calculate the hash.
uint32_t SymbolName::gnu_hash() { if (!has_gnuhash) { uint32_t h = 5381; const uint8_t name = reinterpret_cast<const uint8_t>(name_); while (name != 0) { h += (h << 5) + name++; // h33 + c = h + h 32 + c = h + h << 5 + c }
}
return gnuhash; }
Using hand-written arm32 Neon assembly, I wrote a function that takes 30-50% less time than the simple C++ version. Using C++ code with Neon intrinsics instead, I can write something that's still faster than the simple C++ version, but has about half the improvement when I compile with Clang. GCC, on the other hand, gets much closer to my hand-written assembly.
Here are some numbers on an arm32-only Go phone. I used the "performance" scaling governor. I used the https://tratt.net/laurie/src/multitime utility to run benchmarks repeatedly and calculate confidence intervals.
Clang, simple C function: 0.441+/-0.0001 (in seconds of wall clock time) GCC, simple C function: 0.376+/-0.0001 Clang, using Neon intrinsics: 0.373+/-0.0001 (Clang ignored pragma unroll) GCC, using Neon intrinsics: 0.330+/-0.0001 (w/ no pragma GCC unroll) GCC, using Neon intrinsics: 0.312+/-0.0003 (w/ pragma GCC unroll 8) Handwritten assembly: 0.311+/-0.0001
I also looked at a walleye Pixel 2 device (core 4, one of the fast ones). For arm32:
Clang, simple C function: 0.347+/-0.0023 GCC, simple C function: 0.323+/-0.0021 Clang, using Neon intrinsics: 0.225+/-0.0013 GCC, using Neon intrinsics: 0.208+/-0.0013 (w/ no pragma GCC unroll) GCC, using Neon intrinsics: 0.186+/-0.0007 (w/ pragma GCC unroll 8) Handwritten assembly: 0.176+/-0.0013
I don't have handwritten assembly for arm64, but I benchmarked the C++ code.
Clang, simple C function: 0.308+/-0.0017 GCC, simple C function: 0.285+/-0.0018 Clang, using Neon intrinsics: 0.205+/-0.0016 (Clang ignored pragma unroll) GCC, using Neon intrinsics: 0.189+/-0.0010 (w/ no pragma GCC unroll) GCC, using Neon intrinsics: 0.217+/-0.0015 (w/ pragma GCC unroll 4) GCC, using Neon intrinsics: 0.214+/-0.0004 (w/ pragma GCC unroll 8)
I attached a tarball with the source code, Makefile, and a couple of scripts for running the benchmarks via adb.
I also uploaded three assembly files: