Unrolling single trivial loop generates 3X slower program on M1 mac

namniav commented 2 years ago

#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <cassert>

#include <utility>

constexpr std::pair<int, int> p[3] = {{-233,-113}, {-91,-23}, {3,37}};

// Using template for compacity. Using separative functions doesn't change this issue.
template<int N>
bool cond_loop(int x) {
  for (int i = 0; i < 3; ++i)
    if (i < N && p[i].first <= x && x <= p[i].second)
      return true;
  return false;
}

template<int N>
bool cond_unroll(int x) {
  return (0 < N && p[0].first <= x && x <= p[0].second) ||
         (1 < N && p[1].first <= x && x <= p[1].second) ||
         (2 < N && p[2].first <= x && x <= p[2].second);
}

template<auto Cond>
__attribute__((noinline))
long long conditional_sum(signed char* data, int n)
{
  auto sum = 0LL;
  for (int i = 0; i < n; ++i) {
    if (Cond(data[i])) sum += data[i];
    else               sum += data[i] & 0xff;
  }
  return sum;
}

template<auto Cond>
__attribute__((noinline))
void benchmark(signed char* data, int n, int repeat) {
  auto s = 0LL;
  auto t1 = clock();
  for (int i = 0; i < repeat; ++i) {
    s += conditional_sum<Cond>(data, n);

    // prevent optimization between loops by calling an external function
    snprintf(nullptr, 0, "%p", (void*)data);
  }
  auto t2 = clock();
  printf("checksum=%lld time=%lldms\n",
      s, (long long)(t2 - t1) * 1000 / CLOCKS_PER_SEC);
}

int main()
{
  srand(31415926);
  int n = 32 * 1024;
  auto data = new signed char[n];
  for (int i = 0; i < n; ++i)
    data[i] = (rand() % 256) - 128;

  for (int i = 0; i < n; ++i) {
    assert(cond_loop<2>(data[i]) == cond_unroll<2>(data[i]) &&
           cond_loop<3>(data[i]) == cond_unroll<3>(data[i]));
  }

  for (int i = 0; i < 4; ++i) {
    printf("loop  <2> ");
    benchmark<cond_loop  <2>> (data, n, 3000);
    printf("unroll<2> ");
    benchmark<cond_unroll<2>> (data, n, 3000);
    printf("loop  <3> ");
    benchmark<cond_loop  <3>> (data, n, 3000);
    printf("unroll<3> ");
    benchmark<cond_unroll<3>> (data, n, 3000);
    puts("");
  }
}

Output on my M1 macbook air(compiled byclang++ -Wall -Wextra -Werror -std=c++20 -O3 -fno-lto test.cpp):

loop  <2> checksum=4255083000 time=207ms
unroll<2> checksum=4255083000 time=168ms
loop  <3> checksum=4255083000 time=60ms
unroll<3> checksum=4255083000 time=185ms

Issue 1: Compared to cond_unroll<3>, cond_loop<3> makes conditional_sum 3X faster ! I don't see the difference from source code except that cond_unroll<3> unrolls the trivial loop in cond_loop<3>.
Issue 2: Compared to cond_loop<2>，cond_loop<3> makes conditional_sum 3X faster ! The difference is that cond_loop<3> have an addtional positive interval. But for positive signed char data[i], sum += data[i] is equivalent to sum += data[i] & 0xff. Why adding a useless positive interval cause condition_sum 3X faster?

Note that Clang on my desktop PC(Ubuntu20.04 running on Intel CPU) doesn't have this issue. This might be related to specific platform. Apple's Clang also doesn't have this issue.

Versions:

❯ clang --version
Homebrew clang version 13.0.1
Target: arm64-apple-darwin21.4.0
Thread model: posix
InstalledDir: /opt/homebrew/opt/llvm/bin
❯ uname -a
Darwin Namniav-M1-Air.local 21.4.0 Darwin Kernel Version 21.4.0: Fri Mar 18 00:47:26 PDT 2022; root:xnu-8020.101.4~15/RELEASE_ARM64_T8101 arm64

-v output:

Homebrew clang version 13.0.1
Target: arm64-apple-darwin21.4.0
Thread model: posix
InstalledDir: /opt/homebrew/opt/llvm/bin
 "/opt/homebrew/Cellar/llvm/13.0.1_1/bin/clang-13" -cc1 -triple arm64-apple-macosx12.0.0 -Wundef-prefix=TARGET_OS_ -Werror=undef-prefix -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -emit-obj --mrelax-relocations -disable-free -disable-llvm-verifier -discard-value-names -main-file-name test.cpp -mrelocation-model pic -pic-level 2 -mframe-pointer=non-leaf -fno-rounding-math -munwind-tables -fcompatibility-qualified-id-block-type-checking -fvisibility-inlines-hidden-static-local-var -target-cpu apple-m1 -target-feature +v8.5a -target-feature +fp-armv8 -target-feature +neon -target-feature +crc -target-feature +crypto -target-feature +dotprod -target-feature +fp16fml -target-feature +ras -target-feature +lse -target-feature +rdm -target-feature +rcpc -target-feature +zcm -target-feature +zcz -target-feature +fullfp16 -target-feature +sha2 -target-feature +aes -target-abi darwinpcs -fallow-half-arguments-and-returns -debugger-tuning=lldb -target-linker-version 711 -v -fcoverage-compilation-dir=/Users/namniav/nnspace/test -resource-dir /opt/homebrew/Cellar/llvm/13.0.1_1/lib/clang/13.0.1 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk -stdlib=libc++ -internal-isystem /opt/homebrew/opt/llvm/bin/../include/c++/v1 -internal-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk/usr/local/include -internal-isystem /opt/homebrew/Cellar/llvm/13.0.1_1/lib/clang/13.0.1/include -internal-externc-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk/usr/include -O3 -Wall -Wextra -std=c++20 -fdeprecated-macro -fdebug-compilation-dir=/Users/namniav/nnspace/test -ferror-limit 19 -stack-protector 1 -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 -fno-implicit-modules -fcxx-exceptions -fexceptions -fmax-type-align=16 -fcolor-diagnostics -vectorize-loops -vectorize-slp -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o /var/folders/kz/4wgytmy56ks64tslrymmc0vw0000gn/T/test-8916c0.o -x c++ test.cpp
clang -cc1 version 13.0.1 based upon LLVM 13.0.1 default target arm64-apple-darwin21.4.0
ignoring nonexistent directory "/Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk/usr/local/include"
ignoring nonexistent directory "/Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk/Library/Frameworks"
#include "..." search starts here:
#include <...> search starts here:
 /opt/homebrew/opt/llvm/bin/../include/c++/v1
 /opt/homebrew/Cellar/llvm/13.0.1_1/lib/clang/13.0.1/include
 /Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk/usr/include
 /Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk/System/Library/Frameworks (framework directory)
End of search list.
 "/usr/bin/ld" -demangle -lto_library /opt/homebrew/Cellar/llvm/13.0.1_1/lib/libLTO.dylib -dynamic -arch arm64 -platform_version macos 12.0.0 12.0.0 -syslibroot /Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk -o a.out -L/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib -L. /var/folders/kz/4wgytmy56ks64tslrymmc0vw0000gn/T/test-8916c0.o -lc++ -lSystem /opt/homebrew/Cellar/llvm/13.0.1_1/lib/clang/13.0.1/lib/darwin/libclang_rt.osx.a

llvmbot commented 2 years ago

@llvm/issue-subscribers-backend-aarch64

fhahn commented 2 years ago

So IIUC the issue is with upstream Clang and Apple Clang doesn't have the runtime difference? Could you share the assembly generated by both + the Apple Clang version.

namniav commented 2 years ago

assembly-by-upstream-Clang.txt assembly-by-Apple-Clang.txt

❯ /usr/bin/clang --version
Apple clang version 13.0.0 (clang-1300.0.29.3)
Target: arm64-apple-darwin21.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Output of program generated by Apple Clang:

loop  <2> checksum=4255083000 time=75ms
unroll<2> checksum=4255083000 time=48ms
loop  <3> checksum=4255083000 time=55ms
unroll<3> checksum=4255083000 time=55ms

namniav commented 2 years ago

So IIUC the issue is with upstream Clang and Apple Clang doesn't have the runtime difference?

@fhahn I think the main issue here is that

issue 1 is counterintuitive that unrolling trivial loop usually gives better performance but it gives 3X slower performance here
issue 2 is strange/confusing that it requires an addtional unnecessary condiction (more branches from source code) to obtain significant performance boost

I am not comparing upstream Clang with Apple Clang，but with itself.

gbaraldi commented 2 years ago

I'm not that surprised that it takes a bad decision, the scheduler it uses is for the A7 cyclone from 2013, which probably has very different characteristics. Hopefully Apple can upstream a more accurate model. We see similar cases but in the opposite direction in julia. Some loops it just does a 2x unroll where a 8x unroll is almost 4x faster.

keith commented 2 years ago

Can you try after this change? https://reviews.llvm.org/D119788 (you might have already)

gbaraldi commented 2 years ago

Yeah. It was with that, I can also see a difference when compiling a simple C++ reduction with apple clang vs clang, where apple does a 4x unroll but normal clang does 2x. I suspect the issue is with https://github.com/llvm/llvm-project/blob/1534177f8f7edd83083ceda7c14d6d40cc872c6e/llvm/lib/Target/AArch64/AArch64.td#L1200-L1203

Where the scheduling model is for the A7 cyclone https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64SchedCyclone.td, which leads it to taking non optimal decisions with respect to unrolling.

llvm / llvm-project

Unrolling single trivial loop generates 3X slower program on M1 mac #54770