'for' loops on arrays of SIMD variables are compiled to inefficient code

llvmbot commented 7 years ago


Bugzilla Link	34945
Version	5.0
OS	Linux
Attachments	Testcase, Testcase
Reporter	LLVM Bugzilla Contributor
CC	@echristo,@kbeyls,@pirama-arumuga-nainar,@sanjoy,@stephenhines,@kongy

Extended Description

At least with the LLVM 5.0 toolchain in Android NDK r15c (in fact with each recent NDK LLVM I've tried), when compiling to Aarch64, C++ NEON intrinsics code that uses arrays of NEON variables, like

#include <arm_neon.h>
int32x4_t foo[4];
// This for loop is unrolled by the compiler.
// Manually unrolling it does not make a difference.
for (int i = 0; i < 4; i++) do_something(foo[i]);

is slow; rewriting this code to declare separate variables instead of an array makes it much faster, e.g.

#include <arm_neon.h>
int32x4_t foo0, foo1, foo2, foo3;
// Now we have no choice but to manually unroll this code,
// as we don't have our 4 variables nicely tucked into an array.
do_something(foo0);
do_something(foo1);
do_something(foo2);
do_something(foo3);

I learned that trick from Jan Wassenberg (CC'd). It seems very surprising that this would make any difference at all.

Attaching a self-contained testcase. It's not a minimal testcase, but it allows to quantify the impact of this bug on concrete production code (https://github.com/google/gemmlowp/blob/master/standalone/neon-gemm-kernel-benchmark.cc), and it should be trivial to extract a minimal testcase looking like the above snippets from it, or write one from scratch.

Example compilation command line:

aarch64-linux-android-clang++ -fPIE -static --std=c++11 -O3 simd-testcase.cc -o /tmp/x

Example outputs:

Pixel2 big cores, ARM Cortex-A73:

gemm_kernel_intrinsics_naive_using_arrays_of_neon_variables      14 Gop/s
gemm_kernel_intrinsics_fast_using_separate_neon_variables        21.8 Gop/s
gemm_kernel_inline_asm                                           26.8 Gop/s

Pixel2 little cores, ARM Cortex-A53:

gemm_kernel_intrinsics_naive_using_arrays_of_neon_variables      5.27 Gop/s
gemm_kernel_intrinsics_fast_using_separate_neon_variables        10.3 Gop/s
gemm_kernel_inline_asm                                           11.6 Gop/s

llvmbot commented 6 years ago

preprocessed source

llvmbot commented 6 years ago

It'd be helpful to have a pre-processed (-E) file attached, as not every developer installs an aarch64 c++ toolchain (standard library, libc, etc).

llvmbot commented 7 years ago

https://en.wikipedia.org/wiki/Black_hole_information_paradox#Postulated_solutions

Information is irretrievably lost

malaterre commented 1 year ago

I am pretty sure this one can be closed in current clang.

@jan-wassenberg

jan-wassenberg commented 1 year ago

Nice :) FYI we are still required to avoid arrays of vectors for SVE and RVV because their vectors are sizeless.

llvm / llvm-project

'for' loops on arrays of SIMD variables are compiled to inefficient code #34293

Extended Description