google / highway

Performance-portable, length-agnostic SIMD with runtime dispatch
Apache License 2.0
4.2k stars 318 forks source link

Build error when cross-building for Aarch64 (ARM64) with GCC 12.3 (from Yocto Mickledore 4.2): hwy/base.h:1155:16: error: inlining failed in call to 'always_inline' 'size_t hwy::Num0BitsAboveMS1Bit_Nonzero32(uint32_t)': target specific option mismatch #1570

Closed clopez closed 1 year ago

clopez commented 1 year ago

I tried both with last stable release 1.0.4 (46e365d6770f5d7a4240d8ac9d8e928a520478ea) and with master as of today (7233df1b4e29a04ecfd3a10a1da14c802f08c3fd)

And in both cases I get this build error:

[  5%] Building CXX object CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_128a.cc.o
Reaping winning child 0x557a0072df70 PID 88468 
/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot-native/usr/bin/aarch64-poky-linux/aarch64-poky-linux-g++   -mcpu=cortex-a72 -march=armv8-a+crc -fstack-protector-strong  -O2 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security --sysroot=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot -DHWY_STATIC_DEFINE -I/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git -O2 -pipe -g -feliminate-unused-debug-types -fmacro-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git=/usr/src/debug/highway/1.0.4.99.git20230717-r0  -fdebug-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git=/usr/src/debug/highway/1.0.4.99.git20230717-r0  -fmacro-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/build=/usr/src/debug/highway/1.0.4.99.git20230717-r0  -fdebug-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/build=/usr/src/debug/highway/1.0.4.99.git20230717-r0  -fdebug-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot=  -fmacro-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot=  -fdebug-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot-native=  -fvisibility-inlines-hidden -O2 -g -DNDEBUG -fPIC -fvisibility=hidden -fvisibility-inlines-hidden -Wno-builtin-macro-redefined -D__DATE__=\"redacted\" -D__TIMESTAMP__=\"redacted\" -D__TIME__=\"redacted\" -fmerge-all-constants -Wall -Wextra -Wconversion -Wsign-conversion -Wvla -Wnon-virtual-dtor -fmath-errno -fno-exceptions -MD -MT CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_128a.cc.o -MF CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_128a.cc.o.d -o CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_128a.cc.o -c /home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git/hwy/contrib/sort/vqsort_128a.cc
Live child 0x557a0072df70 (CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_128a.cc.o) PID 88470 
In file included from /home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git/hwy/contrib/sort/vqsort.h:30,
                 from /home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git/hwy/contrib/sort/vqsort_128a.cc:16:
/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git/hwy/base.h: In function 'void hwy::N_NEON::detail::BaseCase(D, TraitsKV, T*, size_t, T*) [with D = hwy::N_NEON::Simd<long unsigned int, 2, 0>; TraitsKV = SharedTraits<Traits128<OrderAscending128> >; T = long unsigned int]':
/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git/hwy/base.h:1155:16: error: inlining failed in call to 'always_inline' 'size_t hwy::Num0BitsAboveMS1Bit_Nonzero32(uint32_t)': target specific option mismatch
 1155 | HWY_API size_t Num0BitsAboveMS1Bit_Nonzero32(const uint32_t x) {
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git/hwy/contrib/sort/vqsort_128a.cc:23,
                 from /home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git/hwy/foreach_target.h:16 ,
                 from /home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git/hwy/contrib/sort/vqsort_128a.cc:20:
/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git/hwy/contrib/sort/vqsort-inl.h:489:41: note: called from here
  489 |       32 - Num0BitsAboveMS1Bit_Nonzero32(static_cast<uint32_t>(num_keys - 1));
      |            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Reaping losing child 0x557a0072df70 PID 88470 
make[2]: *** [CMakeFiles/hwy_contrib.dir/build.make:76: CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_128a.cc.o] Error 1
Removing child 0x557a0072df70 PID 88470 from chain.
Reaping losing child 0x561d7a43fac0 PID 88467 
make[1]: *** [CMakeFiles/Makefile2:285: CMakeFiles/hwy_contrib.dir/all] Error 2
Removing child 0x561d7a43fac0 PID 88467 from chain.
Reaping losing child 0x55d867137200 PID 88457 
make: *** [Makefile:146: all] Error 2
Removing child 0x55d867137200 PID 88457 from chain.

Compiler identification is

# $CXX -v
Using built-in specs.
COLLECT_GCC=aarch64-poky-linux-g++
COLLECT_LTO_WRAPPER=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot-native/usr/bin/aarch64-poky-linux/../../libexec/aarch64-poky-linux/gcc/aarch64-poky-linux/12.3.0/lto-wrapper
Target: aarch64-poky-linux
Configured with: ../../../../../../work-shared/gcc-12.3.0-r0/gcc-12.3.0/configure --build=x86_64-linux --host=x86_64-linux --target=aarch64-poky-linux --prefix=/host-native/usr --exec_prefix=/host-native/usr --bindir=/host-native/usr/bin/aarch64-poky-linux --sbindir=/host-native/usr/bin/aarch64-poky-linux --libexecdir=/host-native/usr/libexec/aarch64-poky-linux --datadir=/host-native/usr/share --sysconfdir=/host-native/etc --sharedstatedir=/host-native/com --localstatedir=/host-native/var --libdir=/host-native/usr/lib/aarch64-poky-linux --includedir=/host-native/usr/include --oldincludedir=/host-native/usr/include --infodir=/host-native/usr/share/info --mandir=/host-native/usr/share/man --disable-silent-rules --disable-dependency-tracking --with-libtool-sysroot=/host-native --enable-clocale=generic --with-gnu-ld --enable-shared --enable-languages=c,c++ --enable-threads=posix --disable-multilib --enable-default-pie --enable-c99 --enable-long-long --enable-symvers=gnu --enable-libstdcxx-pch --program-prefix=aarch64-poky-linux- --without-local-prefix --disable-install-libiberty --disable-libssp --enable-libitm --enable-lto --disable-bootstrap --with-system-zlib --with-linker-hash-style=sysv --enable-linker-build-id --with-ppl=no --with-cloog=no --enable-checking=release --enable-cheaders=c_global --without-isl --with-gxx-include-dir=/not/exist/usr/include/c++/12.3.0 --with-sysroot=/not/exist --with-build-sysroot=/host --enable-standard-branch-protection --enable-poison-system-directories=error --with-system-zlib --disable-static --disable-nls --with-glibc-version=2.28 --enable-initfini-array --enable-__cxa_atexit
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 12.3.0 (GCC) 

With previous version of Yocto (4.1 aka Langdale with GCC 12.2.0) it was working fine.

The yocto recipe is here: https://github.com/Igalia/meta-webkit/blob/main/recipes-extended/highway

clopez commented 1 year ago

Cross-building for ARMv7 (32bits) works fine, only for Aarch64 I'm hitting this issue

jan-wassenberg commented 1 year ago

Hi, thanks for reporting the issue. This looks similar to #1460 which had previously been reported. There, the compiler was configured with a specific -march, but I don't see that in your config. Are you perhaps specifying a -march or -mcpu via CXXFLAGS? Does the workaround in #1460 help?

clopez commented 1 year ago

Yes, I have this defined on the build environment

# env | grep march
CPP=aarch64-poky-linux-gcc -E --sysroot=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot  -mcpu=cortex-a72 -march=armv8-a+crc -fstack-protector-strong  -O2 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security
CXX=aarch64-poky-linux-g++  -mcpu=cortex-a72 -march=armv8-a+crc -fstack-protector-strong  -O2 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security --sysroot=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot
CCLD=aarch64-poky-linux-gcc  -mcpu=cortex-a72 -march=armv8-a+crc -fstack-protector-strong  -O2 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security --sysroot=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot
FC=aarch64-poky-linux-gfortran  -mcpu=cortex-a72 -march=armv8-a+crc -fstack-protector-strong  -O2 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security --sysroot=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot
CC=aarch64-poky-linux-gcc  -mcpu=cortex-a72 -march=armv8-a+crc -fstack-protector-strong  -O2 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security --sysroot=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot

So it is setting -march=armv8-a+crc

I have tested it and the following patch fixes the build error for my use case:

diff --git a/hwy/ops/set_macros-inl.h b/hwy/ops/set_macros-inl.h
index f64a6a5..e750509 100644
--- a/hwy/ops/set_macros-inl.h
+++ b/hwy/ops/set_macros-inl.h
@@ -361,7 +361,7 @@
 // Do not define HWY_TARGET_STR (no pragma).
 #else
 #if HWY_COMPILER_GCC_ACTUAL
-#define HWY_TARGET_STR "arch=armv8-a+crypto"
+#define HWY_TARGET_STR "arch=armv8-a+crc"
 #else  // clang
 #define HWY_TARGET_STR "+crypto"
 #endif  // HWY_COMPILER_*
clopez commented 1 year ago

But I don't understand why it ends in that ifdef.

With arch=armv8-a+crc it doesn't define AES (the RPi4 doesn't have cryptographic extensions).

The only feature that the compiler sets with arch=armv8-a+crc is __ARM_FEATURE_CRC32 but it doesn't set the ones related to crypto.

Check:

# echo | aarch64-poky-linux-g++ -march=armv8-a -E - -dM > armv8-a_baseline
# echo | aarch64-poky-linux-g++ -march=armv8-a+crc -E - -dM > armv8-a_crc
# echo | aarch64-poky-linux-g++ -march=armv8-a+crypto -E - -dM > armv8-a_crypto
# diff -u armv8-a_baseline armv8-a_crc
--- armv8-a_baseline    2023-07-17 20:10:28.421690965 +0000
+++ armv8-a_crc 2023-07-17 20:10:35.737481401 +0000
@@ -299,6 +299,7 @@
 #define __INT_LEAST32_TYPE__ int
 #define __SIZEOF_WCHAR_T__ 4
 #define __UINT64_TYPE__ long unsigned int
+#define __ARM_FEATURE_CRC32 1
 #define __ARM_NEON 1
 #define __FLT128_HAS_QUIET_NAN__ 1
 #define __INTMAX_MAX__ 0x7fffffffffffffffL
# diff -u armv8-a_baseline armv8-a_crypto
--- armv8-a_baseline    2023-07-17 20:10:28.421690965 +0000
+++ armv8-a_crypto      2023-07-17 20:10:42.641283591 +0000
@@ -34,6 +34,7 @@
 #define __UINT_FAST8_MAX__ 0xff
 #define __FLT32_MAX_10_EXP__ 38
 #define __INT8_C(c) c
+#define __ARM_FEATURE_AES 1
 #define __INT_LEAST8_WIDTH__ 8
 #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
 #define __SHRT_MAX__ 0x7fff
@@ -108,6 +109,7 @@
 #define __SIZEOF_LONG_DOUBLE__ 16
 #define __FLT64_MAX_10_EXP__ 308
 #define __FLT16_MAX_10_EXP__ 4
+#define __ARM_FEATURE_CRYPTO 1
 #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
 #define __DBL_HAS_INFINITY__ 1
 #define __INT64_MAX__ 0x7fffffffffffffffL
@@ -198,6 +200,7 @@
 #define __ELF__ 1
 #define __GCC_ASM_FLAG_OUTPUTS__ 1
 #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
+#define __ARM_FEATURE_SHA2 1
 #define __FLT_RADIX__ 2
 #define __INT_LEAST16_TYPE__ short int
 #define __ARM_ARCH_PROFILE 65

And it also has neon of course

# grep -i neon armv8-a_crc 
#define __ARM_NEON 1

So this CPU is an ARMv8-a but without crypto extensions (only crc ones). But it ends entering here

                #if HWY_TARGET == HWY_NEON_WITHOUT_AES
                // Do not define HWY_TARGET_STR (no pragma).
here ---->      #else
                #if HWY_COMPILER_GCC_ACTUAL
                #define HWY_TARGET_STR "arch=armv8-a+crypto"
                #else  // clang
                #define HWY_TARGET_STR "+crypto"
                #endif  // HWY_COMPILER_*
                #endif  // HWY_TARGET == HWY_NEON_WITHOUT_AES

https://github.com/google/highway/blob/9e390d70ad6802cc54f01a3e82ae91f44cd9f006/hwy/ops/set_macros-inl.h#L362

So it is evaluating false the if HWY_TARGET == HWY_NEON_WITHOUT_AES condition

Why is that?

clopez commented 1 year ago

As far as I can see HWY_TARGET gets a value HWY_STATIC_TARGET and is not a matter of passing -march=armv8-a+crypto or -march=armv8-a+crc. It happens the same in both cases

Running the compile command with -E -dD so GCC prints debug information on how it evaluates the defines I see this:

 # aarch64-poky-linux-g++   -mcpu=cortex-a72 -march=armv8-a+crc -fstack-protector-strong  -O2 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security --sysroot=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot -DHWY_STATIC_DEFINE -I/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git -O2 -pipe -g -feliminate-unused-debug-types -fmacro-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git=/usr/src/debug/highway/1.0.4.99.git20230717-r0  -fdebug-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git=/usr/src/debug/highway/1.0.4.99.git20230717-r0  -fmacro-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/build=/usr/src/debug/highway/1.0.4.99.git20230717-r0  -fdebug-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/build=/usr/src/debug/highway/1.0.4.99.git20230717-r0  -fdebug-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot=  -fmacro-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot=  -fdebug-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot-native=  -fvisibility-inlines-hidden -O2 -g -DNDEBUG -fPIC -fvisibility=hidden -fvisibility-inlines-hidden -Wno-builtin-macro-redefined -D__DATE__=\"redacted\" -D__TIMESTAMP__=\"redacted\" -D__TIME__=\"redacted\" -fmerge-all-constants -Wall -Wextra -Wconversion -Wsign-conversion -Wvla -Wnon-virtual-dtor -fmath-errno -fno-exceptions -MD -MT CMakeFiles/hwy.dir/hwy/nanobenchmark.cc.o -MF CMakeFiles/hwy.dir/hwy/nanobenchmark.cc.o.d -E -dD /home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git/hwy/ops/set_macros-inl.h|grep HWY_TARGET
#define HWY_TARGET HWY_STATIC_TARGET
#define HWY_TARGETS (HWY_ATTAINABLE_TARGETS & ((HWY_STATIC_TARGET - 1LL) | HWY_STATIC_TARGET))
#undef HWY_TARGET_STR
#define HWY_TARGET_STR_PCLMUL_AES ",pclmul,aes"
#define HWY_TARGET_STR_BMI2_FMA ",bmi,bmi2,fma"
#define HWY_TARGET_STR_F16C ",f16c"
#define HWY_TARGET_STR_SSE2 "sse2"
#define HWY_TARGET_STR_SSSE3 "sse2,ssse3"
#define HWY_TARGET_STR_SSE4 HWY_TARGET_STR_SSSE3 ",sse4.1,sse4.2" HWY_TARGET_STR_PCLMUL_AES
#define HWY_TARGET_STR_AVX2 HWY_TARGET_STR_SSE4 ",avx,avx2" HWY_TARGET_STR_BMI2_FMA HWY_TARGET_STR_F16C
#define HWY_TARGET_STR_AVX3 HWY_TARGET_STR_AVX2 ",avx512f,avx512cd,avx512vl,avx512dq,avx512bw"
#define HWY_TARGET_STR_AVX3_DL HWY_TARGET_STR_AVX3 ",vpclmulqdq,avx512vbmi,avx512vbmi2,vaes,avx512vnni,avx512bitalg," "avx512vpopcntdq,gfni"
#define HWY_TARGET_STR_AVX3_SPR HWY_TARGET_STR_AVX3_DL ",avx512fp16"
#define HWY_TARGET_STR_PPC8_CRYPTO ",crypto"
#define HWY_TARGET_STR_PPC8 "altivec,vsx,power8-vector" HWY_TARGET_STR_PPC8_CRYPTO
#define HWY_TARGET_STR_PPC9 HWY_TARGET_STR_PPC8 ",power9-vector"
#define HWY_TARGET_STR_PPC10 HWY_TARGET_STR_PPC9 ",cpu=power10"
clopez commented 1 year ago

mmm, is more complex than that.. evaluating a cc file that includes hwy/foreach_target.h like hwy/contrib/sort/vqsort_128a.cc I can see how it assigns different values to HWY_TARGET and it ends with a value of HWY_STATIC_TARGET

# aarch64-poky-linux-g++   -mcpu=cortex-a72 -march=armv8-a+crc -fstack-protector-strong  -O2 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security --sysroot=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot -DHWY_STATIC_DEFINE -I/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git -O2 -pipe -g -feliminate-unused-debug-types -fmacro-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git=/usr/src/debug/highway/1.0.4.99.git20230717-r0  -fdebug-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git=/usr/src/debug/highway/1.0.4.99.git20230717-r0  -fmacro-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/build=/usr/src/debug/highway/1.0.4.99.git20230717-r0  -fdebug-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/build=/usr/src/debug/highway/1.0.4.99.git20230717-r0  -fdebug-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot=  -fmacro-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot=  -fdebug-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/recipe-sysroot-native=  -fvisibility-inlines-hidden -O2 -g -DNDEBUG -fPIC -fvisibility=hidden -fvisibility-inlines-hidden -Wno-builtin-macro-redefined -D__DATE__=\"redacted\" -D__TIMESTAMP__=\"redacted\" -D__TIME__=\"redacted\" -fmerge-all-constants -Wall -Wextra -Wconversion -Wsign-conversion -Wvla -Wnon-virtual-dtor -fmath-errno -fno-exceptions -MD -MT CMakeFiles/hwy.dir/hwy/nanobenchmark.cc.o -MF CMakeFiles/hwy.dir/hwy/nanobenchmark.cc.o.d -E -dD /home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/highway/1.0.4.99.git20230717-r0/git/hwy/contrib/sort/vqsort_128a.cc|grep "define HWY_TARGET "
#define HWY_TARGET HWY_STATIC_TARGET
#define HWY_TARGET HWY_NEON
#define HWY_TARGET HWY_SVE
#define HWY_TARGET HWY_SVE2
#define HWY_TARGET HWY_SVE_256
#define HWY_TARGET HWY_SVE2_128
#define HWY_TARGET HWY_STATIC_TARGET

You have a really complex system for setting this compiler directives, and not easy to debug. I don't understand what is going on.

clopez commented 1 year ago

Ok.. forget what I said above.

Does the workaround in #1460 help?

I have read again with more attention the comments there, and the workaround is not to change the value of HWY_TARGET_STR but to enable static dispatch.

That works in theory and it looks like the right solution when using Yocto because with Yocto you only target a very specific CPU. You don't build for a set of CPUs but only for a very specific machine.

But in practice then I have later issues when building libjxl (that is why I'm trying to use highway, just to use libxjl).

Seems libjxl calls directly HWY_DYNAMIC_DISPATCH in several parts of the code.

So if I enable a build with static dispatch via CXXFLAGS/CFLAGS => -DHWY_COMPILE_ONLY_STATIC then I get later this build error on libjxl

| FAILED: lib/CMakeFiles/jxl_dec-obj.dir/jxl/modular/transform/squeeze.cc.o
| /home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/recipe-sysroot-native/usr/bin/aarch64-poky-linux/aarch64-poky-linux-g++ -DHWY_DISABLED_TARGETS="(HWY_SVE|HWY_SVE2|HWY_SVE_256|HWY_SVE2_128|HWY_RVV)" -DJPEGXL_MAJOR_VERSION=0 -DJPEGXL_MINOR_VERSION=8 -DJPEGXL_PATCH_VERSION=1 -DJXL_INTERNAL_LIBRARY_BUILD -D__DATE__=\"redacted\" -D__TIMESTAMP__=\"redacted\" -D__TIME__=\"redacted\" -I/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/git -I/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/git/lib/include -I/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/build/lib/include -mcpu=cortex-a72 -march=armv8-a+crc -fstack-protector-strong  -O2 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security  --sysroot=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/recipe-sysroot  -O2 -pipe -g -feliminate-unused-debug-types -fmacro-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/git=/usr/src/debug/libjxl/0.8.1-r0  -fdebug-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/git=/usr/src/debug/libjxl/0.8.1-r0  -fmacro-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/build=/usr/src/debug/libjxl/0.8.1-r0  -fdebug-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/build=/usr/src/debug/libjxl/0.8.1-r0  -fdebug-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/recipe-sysroot=  -fmacro-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/recipe-sysroot=  -fdebug-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/recipe-sysroot-native=  -fvisibility-inlines-hidden -fno-rtti -funwind-tables -fno-omit-frame-pointer -fPIC -fvisibility=hidden -fvisibility-inlines-hidden -fmacro-prefix-map=/home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/git=. -Wno-builtin-macro-redefined -Wall -fmerge-all-constants -fno-builtin-fwrite -fno-builtin-fread -Wextra -Wc++11-compat -Warray-bounds -Wformat-security -Wimplicit-fallthrough -Wno-register -Wno-unused-function -Wno-unused-parameter -Wnon-virtual-dtor -Woverloaded-virtual -Wvla -fsized-deallocation -fno-exceptions -fmath-errno -DJPEGXL_ENABLE_TRANSCODE_JPEG=0 -DJPEGXL_ENABLE_BOXES=1 -std=c++11 -MD -MT lib/CMakeFiles/jxl_dec-obj.dir/jxl/modular/transform/squeeze.cc.o -MF lib/CMakeFiles/jxl_dec-obj.dir/jxl/modular/transform/squeeze.cc.o.d -o lib/CMakeFiles/jxl_dec-obj.dir/jxl/modular/transform/squeeze.cc.o -c /home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/git/lib/jxl/modular/transform/squeeze.cc
| In file included from /home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/git/lib/jxl/modular/transform/squeeze.h:30,
|                  from /home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/git/lib/jxl/modular/transform/squeeze.cc:6:
| /home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/git/lib/jxl/modular/modular_image.h: In lambda function:
| /home/clopez/webkit/webkit/WebKitBuild/CrossToolChains/rpi4-64bits-mesa/build/tmp/work/cortexa72-poky-linux/libjxl/0.8.1-r0/git/lib/jxl/modular/modular_image.h:70:26: error: inlining failed in call to 'always_inline' 'jxl::pixel_type* jxl::Channel::Row(size_t)': target specific option mismatch
|    70 |   JXL_INLINE pixel_type* Row(const size_t y) { return plane.Row(y); }
|       |                          ^~~
jan-wassenberg commented 1 year ago

Hi @clopez, thanks for looking into this further. Yes, HWY_TARGET is set multiple times: JPEG XL compiles the code once per target instruction set, and the binary then contains code for all of them. This 'dynamic dispatch' model is in contrast to your "I always want to build with just +crc" (static dispatch).

It is actually OK to still use HWY_DYNAMIC_DISPATCH, this will just involve an extra function call (no problem). Static dispatch really just means limiting the set of HWY_TARGET to a single option.

To get that, it is important to specify the -DHWY_COMPILE_ONLY_STATIC both when compiling Highway as well as JPEG XL. Is it possible that it is only being set for Highway?

clopez commented 1 year ago

To get that, it is important to specify the -DHWY_COMPILE_ONLY_STATIC both when compiling Highway as well as JPEG XL. Is it possible that it is only being set for Highway?

I'm building highway as a shared library on one hand, and then building libjxl with -DJPEGXL_FORCE_SYSTEM_HWY=ON on the other hand, so the idea is that it links against that shared highway library that I built previously.

So: yes, I'm not passing -DHWY_COMPILE_ONLY_STATIC to the libjxl build, However, I have been now grepping the whole source code of libjxl for HWY_COMPILE_ONLY_STATIC strings and I don't see it defined anywhere. I don't see how defining that for the libjxl build is going to make any difference assuming that I'm not linking statically with the bundled highway that libjxl uses, but instead I'm trying to dynamically link against a previously built highway shared library.

jan-wassenberg commented 1 year ago

Got it. FYI the Highway shared library has very little in it. Most happens in headers, and JPEG XL includes those Highway headers. Adding -DHWY_COMPILE_ONLY_STATIC to the JPEG XL build changes their behavior in the desired way :)

clopez commented 1 year ago

I see. Thanks for the info and the documentation.

I found another solution to this issue that allows to build with dynamic dispatch enabled. See: https://github.com/google/highway/pull/1589