randombit / botan

Cryptography Toolkit
https://botan.randombit.net
BSD 2-Clause "Simplified" License
2.54k stars 562 forks source link

MacOS: botan built with Xcode-13 fails SHA-3 tests #2802

Closed mouse07410 closed 2 years ago

mouse07410 commented 2 years ago

Apple released Xcode-13. When Botan is built with it, it fails tests. Both master and release-2, in the same way. On all my Macs.

Branch release-2

Configuration

./configure.py --prefix=/opt/local --with-os-features=security_framework,apple_keychain,commoncrypto,getentropy --with-openmp --with-commoncrypto --with-openssl --with-boost --with-lzma --with-bzip2 --with-zlib --with-sqlite3 --with-python-version=2.7 --with-sphinx --with-pdf --cc-abi-flags='-march=native -O3 -I/opt/local/include' 2>&1 | tee conf-out.txt

Configuration output:

$ cat conf-out.txt 
   INFO: ./configure.py invoked with options "--prefix=/opt/local --with-os-features=security_framework,apple_keychain,commoncrypto,getentropy --with-openmp --with-commoncrypto --with-openssl --with-boost --with-lzma --with-bzip2 --with-zlib --with-sqlite3 --with-python-version=2.7 --with-sphinx --with-pdf --cc-abi-flags=-march=native -O3 -I/opt/local/include"
   INFO: Configuring to build Botan 2.18.1 (revision git:b420a4545b0f9219a88c209e4c8c2474d519dfac)
   INFO: Running under 3.9.7 (default, Sep  1 2021, 12:35:15) [Clang 12.0.5 (clang-1205.0.22.9)]
   INFO: Implicit --cc-bin=clang++ due to environment variable CXX
   INFO: Implicit --cxxflags=-std=gnu++17 -O3 -march=native -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk due to environment variable CXXFLAGS
   INFO: Autodetected platform information: OS="Darwin" machine="x86_64" proc="i386"
   INFO: Guessing target OS is darwin (use --os to set)
   INFO: Guessing target processor is a x86_64 (use --cpu to set)
   INFO: Using /etc/ssl/cert.pem as system certificate store
   INFO: Auto-detected compiler version 4.0
   INFO: Auto-detected compiler arch x86_64
   INFO: Target is clang:4.0-macos-x86_64
   INFO: Assuming target x86_64 is little endian
   INFO: Skipping (incompatible CPU): aes_armv8 aes_power8 sha1_armv8 sha2_32_armv8 sm4_armv8
   INFO: Skipping (incompatible OS): certstor_system_windows proc_walk win32_stats
   INFO: Skipping (requires external dependency): tpm
   INFO: Enabling use of external dependency boost
   INFO: Enabling use of external dependency bzip2
   INFO: Enabling use of external dependency commoncrypto
   INFO: Enabling use of external dependency lzma
   INFO: Enabling use of external dependency openssl
   INFO: Enabling use of external dependency sqlite3
   INFO: Enabling use of external dependency zlib
   INFO: Loading modules: adler32 aead aes aes_ni aes_vperm aont argon2 aria asio asn1 auto_rng base base32 base58 base64 bcrypt bcrypt_pbkdf bigint blake2 block blowfish boost bzip2 camellia cascade cast128 cast256 cbc cbc_mac ccm cecpq1 certstor_flatfile certstor_sql certstor_sqlite3 certstor_system certstor_system_macos cfb chacha chacha20poly1305 chacha_avx2 chacha_rng chacha_simd32 checksum cmac comb4p commoncrypto compression cpuid crc24 crc32 cryptobox ctr curve25519 des dev_random dh dl_algo dl_group dlies dsa dyn_load eax ec_group ecc_key ecdh ecdsa ecgdsa ecies eckcdsa ed25519 elgamal eme_oaep eme_pkcs1 eme_raw emsa1 emsa_pkcs1 emsa_pssr emsa_raw emsa_x931 entropy fd_unix ffi filters fpe_fe1 gcm getentropy ghash ghash_cpu ghash_vperm gmac gost_28147 gost_3410 gost_3411 hash hash_id hex hkdf hmac hmac_drbg hotp http_util idea idea_sse2 iso9796 kasumi kdf kdf1 kdf1_iso18033 kdf2 keccak keypair lion locking_allocator lzma mac mce mceies md4 md5 mdx_hash mem_pool mgf1 misty1 mode_pad modes mp newhope nist_keywrap noekeon noekeon_simd numbertheory ocb ofb openssl par_hash passhash9 pbes2 pbkdf pbkdf1 pbkdf2 pem pgp_s2k pk_pad pkcs11 poly1305 poly_dbl prf_tls prf_x942 processor_rng psk_db pubkey rc4 rdrand_rng rdseed rfc3394 rfc6979 rmd160 rng roughtime rsa salsa20 scrypt seed serpent serpent_avx2 serpent_simd sessions_sql sessions_sqlite3 sha1 sha1_sse2 sha1_x86 sha2_32 sha2_32_bmi2 sha2_32_x86 sha2_64 sha2_64_bmi2 sha3 sha3_bmi2 shacal2 shacal2_avx2 shacal2_simd shacal2_x86 shake shake_cipher simd simd_avx2 siphash siv skein sm2 sm3 sm4 socket sodium sp800_108 sp800_56a sp800_56c sqlite3 srp6 stateful_rng stream streebog system_rng thread_utils threefish_512 threefish_512_avx2 tiger tls tls_10 tls_cbc tss twofish utils uuid whirlpool x509 x919_mac xmss xtea xts zlib
   INFO: Using symlink to link files into build dir (use --link-method to change)
   INFO: Botan 2.18.1 (revision git:b420a4545b0f9219a88c209e4c8c2474d519dfac) (unreleased undated) build setup is complete

Build output: make-out.txt.gz

Tests output

.  .  .  .  .
1416a6f128a2567fdf10079d74d2f64aaa8e2834216c698118f69109580b0f61c6fc53fdd578276e4f6b1e8fb1e5cd04a2450620c1dca97c517dc81ecfbd3776fbb75b2f211ddef474304929e0a2ef57121ba873a145e7cec15d3af0605f6e9cbc84ff70e4072f9e694557c302e2c2bb3db14bd52707b47890731e0cf6181d297d012967c3fd561f905b8a4ba23487]
xmss_verify_invalid:
XMSS/SHA2_10_256 verify invalid signature ran 28 tests in 28.65 msec all ok
XMSS/SHA2_10_512 verify invalid signature ran 28 tests in 74.90 msec all ok
XMSS/SHA2_16_256 verify invalid signature ran 28 tests in 37.11 msec all ok
XMSS/SHA2_16_512 verify invalid signature ran 28 tests in 94.26 msec all ok
XMSS/SHA2_20_256 verify invalid signature ran 28 tests in 41.33 msec all ok
XMSS/SHA2_20_512 verify invalid signature ran 28 tests in 88.67 msec all ok
XMSS/SHAKE_10_256 verify invalid signature ran 28 tests in 25.33 msec all ok
XMSS/SHAKE_10_512 verify invalid signature ran 28 tests in 95.61 msec all ok
XMSS/SHAKE_16_256 verify invalid signature ran 28 tests in 53.11 msec all ok
XMSS/SHAKE_16_512 verify invalid signature ran 28 tests in 118.49 msec all ok
XMSS/SHAKE_20_256 verify invalid signature ran 28 tests in 31.91 msec all ok
XMSS/SHAKE_20_512 verify invalid signature ran 28 tests in 129.48 msec all ok
Tests complete ran 2858301 tests in 18.89 sec 17713 tests failed

Full output: make-out.txt.gz

Branch master

Configuration

./configure.py --prefix=/opt/local --with-os-features=security_framework,apple_keychain,commoncrypto,getentropy --with-commoncrypto --with-openssl --with-boost --with-lzma --with-bzip2 --with-zlib --with-sqlite3 --with-python-version=3.9 --with-sphinx --with-pdf --system-cert-bundle=/opt/local/share/curl/curl-ca-bundle.crt --cc-abi-flags='-march=native -O3 -I/opt/local/include' 2>&1 | tee conf-out.txt

Output:

$ cat conf-out.txt 
   INFO: ./configure.py invoked with options "--prefix=/opt/local --with-os-features=security_framework,apple_keychain,commoncrypto,getentropy --with-commoncrypto --with-openssl --with-boost --with-lzma --with-bzip2 --with-zlib --with-sqlite3 --with-python-version=3.9 --with-sphinx --with-pdf --system-cert-bundle=/opt/local/share/curl/curl-ca-bundle.crt --cc-abi-flags=-march=native -O3 -I/opt/local/include"
   INFO: Configuring to build Botan 3.0.0-alpha0 (revision git:20e87b077c113744600510c431af1396663260a0)
   INFO: Running under 3.9.7 (default, Sep  1 2021, 12:35:15) [Clang 12.0.5 (clang-1205.0.22.9)]
   INFO: Implicit --cc-bin=clang++ due to environment variable CXX
   INFO: Implicit --cxxflags=-std=gnu++17 -O3 -march=native -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk due to environment variable CXXFLAGS
   INFO: Autodetected platform information: OS="Darwin" machine="x86_64" proc="i386"
   INFO: Guessing target OS is darwin (use --os to set)
   INFO: Guessing target processor is a x86_64 (use --cpu to set)
   INFO: Auto-detected compiler version 4.0
   INFO: Auto-detected compiler arch x86_64
   INFO: Target is clang:4.0-macos-x86_64
   INFO: Assuming target x86_64 is little endian
   INFO: Skipping (incompatible CPU): aes_armv8 aes_power8 sha1_armv8 sha2_32_armv8 shacal2_armv8 sm4_armv8
   INFO: Skipping (incompatible OS): certstor_system_windows win32_stats
   INFO: Skipping (requires external dependency): tpm
   INFO: Enabling use of external dependency boost
   INFO: Enabling use of external dependency bzip2
   INFO: Enabling use of external dependency commoncrypto
   INFO: Enabling use of external dependency lzma
   INFO: Enabling use of external dependency openssl
   INFO: Enabling use of external dependency sqlite3
   INFO: Enabling use of external dependency zlib
   INFO: Loading modules: adler32 aead aes aes_ni aes_vperm argon2 argon2fmt aria asio asn1 auto_rng base base32 base58 base64 bcrypt bcrypt_pbkdf bigint blake2 blake2mac block blowfish boost bzip2 camellia cascade cast128 cbc ccm cecpq1 certstor_flatfile certstor_sql certstor_sqlite3 certstor_system certstor_system_macos cfb chacha chacha20poly1305 chacha_avx2 chacha_rng chacha_simd32 checksum cmac comb4p commoncrypto compression cpuid crc24 crc32 cryptobox ctr curve25519 des dh dl_algo dl_group dlies dsa dyn_load eax ec_group ec_h2c ecc_key ecdh ecdsa ecgdsa ecies eckcdsa ed25519 elgamal eme_oaep eme_pkcs1 eme_raw emsa1 emsa_pkcs1 emsa_pssr emsa_raw emsa_x931 entropy fd_unix ffi filters fpe_fe1 gcm getentropy ghash ghash_cpu ghash_vperm gmac gost_28147 gost_3410 gost_3411 hash hash_id hex hkdf hmac hmac_drbg hotp http_util idea idea_sse2 iso9796 kdf kdf1 kdf1_iso18033 kdf2 keccak keypair lion locking_allocator lzma mac mce md4 md5 mdx_hash mem_pool mgf1 mode_pad modes mp newhope nist_keywrap noekeon noekeon_simd numbertheory ocb ofb openssl par_hash passhash9 pbes2 pbkdf pbkdf2 pem pgp_s2k pk_pad pkcs11 poly1305 poly_dbl prf_tls prf_x942 processor_rng psk_db pubkey rc4 rdseed rfc3394 rfc6979 rmd160 rng roughtime rsa salsa20 scrypt seed serpent serpent_avx2 serpent_simd sessions_sql sessions_sqlite3 sha1 sha1_sse2 sha1_x86 sha2_32 sha2_32_bmi2 sha2_32_x86 sha2_64 sha2_64_bmi2 sha3 sha3_bmi2 shacal2 shacal2_avx2 shacal2_simd shacal2_x86 shake shake_cipher simd simd_avx2 siphash siv skein sm2 sm3 sm4 socket sodium sp800_108 sp800_56a sp800_56c sqlite3 srp6 stateful_rng stream streebog system_rng thread_utils threefish_512 threefish_512_avx2 tls tls_cbc tss twofish utils uuid whirlpool x509 x919_mac xmss xts zlib
   INFO: Using symlink to link files into build dir (use --link-method to change)
   INFO: Botan 3.0.0-alpha0 (revision git:20e87b077c113744600510c431af1396663260a0) (unreleased undated) build setup is complete

Build output: make-out.txt.gz

Tests

.  .  .  .  .
c81ecfbd3776fbb75b2f211ddef474304929e0a2ef57121ba873a145e7cec15d3af0605f6e9cbc84ff70e4072f9e694557c302e2c2bb3db14bd52707b47890731e0cf6181d297d012967c3fd561f905b8a4ba23487 
xmss_verify_invalid:
XMSS/SHA2_10_256 verify invalid signature ran 28 tests in 34.19 msec all ok
XMSS/SHA2_10_512 verify invalid signature ran 28 tests in 69.04 msec all ok
XMSS/SHA2_16_256 verify invalid signature ran 28 tests in 34.38 msec all ok
XMSS/SHA2_16_512 verify invalid signature ran 28 tests in 94.06 msec all ok
XMSS/SHA2_20_256 verify invalid signature ran 28 tests in 40.87 msec all ok
XMSS/SHA2_20_512 verify invalid signature ran 28 tests in 79.71 msec all ok
XMSS/SHAKE_10_256 verify invalid signature ran 28 tests in 30.20 msec all ok
XMSS/SHAKE_10_512 verify invalid signature ran 28 tests in 83.43 msec all ok
XMSS/SHAKE_16_256 verify invalid signature ran 28 tests in 46.80 msec all ok
XMSS/SHAKE_16_512 verify invalid signature ran 28 tests in 130.14 msec all ok
XMSS/SHAKE_20_256 verify invalid signature ran 28 tests in 29.96 msec all ok
XMSS/SHAKE_20_512 verify invalid signature ran 28 tests in 121.47 msec all ok
Tests complete ran 2850756 tests in 18.28 sec 17713 tests failed (in entropy hash hash_rep mac newhope stream xmss_sign xmss_verify)

Full output: test-out.txt.gz

reneme commented 2 years ago

I can reproduce this on my Mac. Actually, its not only SHAKE tests but also others:

SHAKE-128 ran 13740 tests in 34.75 msec 1145 FAILED
Keccak-1600(224) ran 2667 tests in 7.57 msec 1595 FAILED
Keccak-1600(256) ran 2667 tests in 6.96 msec 1595 FAILED
Keccak-1600(384) ran 2667 tests in 8.14 msec 1595 FAILED
Keccak-1600(512) ran 2667 tests in 9.15 msec 1595 FAILED
SHA-3(224) ran 1994 tests in 5.26 msec 1186 FAILED
SHA-3(256) ran 1994 tests in 5.37 msec 1186 FAILED
SHA-3(384) ran 1994 tests in 5.82 msec 1186 FAILED
SHA-3(512) ran 1994 tests in 8.73 msec 1186 FAILED
SHAKE-128(1120) ran 10 tests in 0.03 msec 6 FAILED
SHAKE-128(128) ran 2107 tests in 5.04 msec 1259 FAILED
SHAKE-256(2000) ran 10 tests in 0.03 msec 6 FAILED
SHAKE-256(256) ran 27 tests in 0.10 msec 15 FAILED
HMAC(SHA-3(224)) ran 88 tests in 0.44 msec 32 FAILED
HMAC(SHA-3(256)) ran 88 tests in 0.40 msec 32 FAILED
HMAC(SHA-3(384)) ran 88 tests in 0.42 msec 32 FAILED
HMAC(SHA-3(512)) ran 88 tests in 0.43 msec 32 FAILED
Long input SHA-3(224) ran 1 tests in 7.35 msec 1 FAILED
Long input SHA-3(256) ran 1 tests in 7.74 msec 1 FAILED
Long input SHA-3(384) ran 1 tests in 10.92 msec 1 FAILED
Long input SHA-3(512) ran 1 tests in 15.17 msec 1 FAILED
NEWHOPE ran 4000 tests in 342.16 msec 4000 FAILED
XMSS/SHAKE_10_256 signature generation ran 27 tests in 1.83 sec 6 FAILED
XMSS/SHAKE_10_256 signature verification ran 21 tests in 73.44 msec 3 FAILED
XMSS/SHAKE_10_512 signature verification ran 21 tests in 265.49 msec 3 FAILED
XMSS/SHAKE_16_256 signature verification ran 21 tests in 60.77 msec 3 FAILED
XMSS/SHAKE_16_512 signature verification ran 21 tests in 238.57 msec 3 FAILED
XMSS/SHAKE_20_256 signature verification ran 21 tests in 30.30 msec 3 FAILED
XMSS/SHAKE_20_512 signature verification ran 21 tests in 97.03 msec 3 FAILED
reneme commented 2 years ago

Same output for the current master, by the way.

reneme commented 2 years ago

If I compile in debug mode (without optimizations) the SHA-3 tests work (with disabled BMI2 module). Results for other tests pending...

./configure.py --minimized-build --enable-modules='sha3' --build-targets='static,tests' --debug-mode
./botan-test hash
Testing Botan 3.0.0-alpha0 (unreleased, revision git:d17ffc7438d467cf311ce4ad0a6d9889f2e31f9a, distribution unspecified)
CPU flags: sse2 ssse3 sse41 sse42 avx2 rdtsc bmi1 bmi2 adx aes_ni clmul rdrand rdseed
Starting tests drbg_seed:00004FC5F5942DB3
hash:
SHA-3(224) ran 1994 tests in 28.48 msec all ok
SHA-3(256) ran 1994 tests in 28.57 msec all ok
SHA-3(384) ran 1994 tests in 26.53 msec all ok
SHA-3(512) ran 1994 tests in 58.52 msec all ok
Tests complete ran 7976 tests in 590.43 msec all tests ok

Edit: full test body (on master) works fine in debug mode.

[...]
Tests complete ran 2675969 tests in 74.42 sec all tests ok

Edit2: without optimizations (./configure.py --minimized-build --enable-modules='sha3' --build-targets='static,tests' --no-optimizations) tests run fine as well.

Edit3: -O1 produces working tests, -O2 as well as -O3 makes tests fail. :-( Below is an example ./configure.py call to enable -O1:

./configure.py --minimized-build --enable-modules='sha3' --build-targets='static,tests' --no-optimizations --extra-cxxflags='-O1'
mouse07410 commented 2 years ago

I concur.

What would happen if you omit --no-optimizations flag, but provide --extra-cxxflags='-O1'?

In my case, where I have a "standard" CXXFLAGS env var to control all of my compiles on the system, I'l probably need something like --cxxflags='XXXX'? I don't want to interfere with flags that must be set for the build to work (that the configurator sets currently), but on the other hand there are some options that must be passed (like, -I/opt/local/include for Boost, -isysroot ....... for the compiler to find system header files that Apple now sticks in , etc.) - usually I pass them via CXXFLAGS for all of my projects, should I do so here as well?

It was my understanding that -O1 performs some optimization (as evidenced by the resulting binary size?), and --no-optimizations should be -O0?

From the Clang docs:

Code Generation Options

-O0, -O1, -O2, -O3, -Ofast, -Os, -Oz, -Og, -O, -O4
Specify which optimization level to use:

-O0 Means “no optimization”: this level compiles the fastest and generates the most debuggable code.

-O1 Somewhere between -O0 and -O2.

-O2 Moderate level of optimization which enables most optimizations.

-O3 Like -O2, except that it enables optimizations that take longer to perform or that may generate larger code (in an attempt to make the program run faster).

-Ofast Enables all the optimizations from -O3 along with other aggressive optimizations that may violate strict compliance with language standards.

-Os Like -O2 with extra optimizations to reduce code size.

-Oz Like -Os (and thus -O2), but reduces code size further.

-Og Like -O1. In future versions, this option might disable different optimizations in order to improve debuggability.

-O Equivalent to -O1.

-O4 and higher

Currently equivalent to -O3
mouse07410 commented 2 years ago

I'm getting the same results (enabling optimization -O2 or better kills the tests) with Clang-12 form Macports.

So, it's not probably not a bug introduced by Xcode-13. SHA3 code is the likelier culprit. I suspect we should try -fsanitize=undefined?

reneme commented 2 years ago

Dang! If I build with UBSan, the tests work okay:

./configure.py --minimized-build --enable-modules='sha3' --build-targets='static,tests' --extra-cxxflags='-fsanitize=undefined'  --ldflags='-fsanitize=undefined'
./botan-test hash
Testing Botan 3.0.0-alpha0 (unreleased, revision git:d17ffc7438d467cf311ce4ad0a6d9889f2e31f9a, distribution unspecified)
CPU flags: sse2 ssse3 sse41 sse42 avx2 rdtsc bmi1 bmi2 adx aes_ni clmul rdrand rdseed
Starting tests drbg_seed:0000668EDE80C905
hash:
SHA-3(224) ran 1994 tests in 8.36 msec all ok
SHA-3(256) ran 1994 tests in 8.09 msec all ok
SHA-3(384) ran 1994 tests in 8.09 msec all ok
SHA-3(512) ran 1994 tests in 16.73 msec all ok
Tests complete ran 7976 tests in 183.44 msec all tests ok
mouse07410 commented 2 years ago

Interesting. And there was no report from UBSAN, as far as I could see...

Although I had to build with Macports Clang-12, because Xcode Clang failed to link with sanitizer (?!)...

Testing Botan 3.0.0-alpha0 (unreleased, revision git:8ee0c78c3f7561a2353bc69ac6b6baf27dbaa60e, distribution unspecified)
CPU flags: sse2 ssse3 sse41 sse42 avx2 avx512f avx512dq avx512bw rdtsc bmi1 bmi2 adx aes_ni clmul rdrand rdseed
reneme commented 2 years ago

Did you link with -fsanitze=undefined? Without that it didn't link for me either. Same ./configure.py as above, but with line breaks:

./configure.py                              \
    --minimized-build                       \
    --enable-modules='sha3'                 \
    --build-targets='static,tests'          \
    --extra-cxxflags='-fsanitize=undefined' \
    --ldflags='-fsanitize=undefined'
mouse07410 commented 2 years ago

Did you link with -fsanitze=undefined?

Ah, sh**. Doch. No, I did not. :-(

Rebuilding...

mouse07410 commented 2 years ago

OK, I've rebuilt as you showed (and with optimization!) - and immediately got ASAN errors:

bcrypt_pbkdf:
src/tests/unit_ecdsa.cpp:307:52: runtime error: load of value 99, which is not a valid value for type 'Botan::PointGFp::Compression_Type'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/tests/unit_ecdsa.cpp:307:52 in 
src/lib/pubkey/ecc_key/ecc_key.cpp:82:7: runtime error: load of value 99, which is not a valid value for type 'PointGFp::Compression_Type'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/lib/pubkey/ecc_key/ecc_key.cpp:82:7 in 
src/lib/pubkey/ecc_key/ecc_key.cpp:83:7: runtime error: load of value 99, which is not a valid value for type 'PointGFp::Compression_Type'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/lib/pubkey/ecc_key/ecc_key.cpp:83:7 in 
src/lib/pubkey/ecc_key/ecc_key.cpp:84:7: runtime error: load of value 99, which is not a valid value for type 'PointGFp::Compression_Type'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/lib/pubkey/ecc_key/ecc_key.cpp:84:7 in 
/opt/local/libexec/llvm-11/bin/../include/c++/v1/functional:1884:16: runtime error: member call on address 0x0001032f5ba0 which does not point to an object of type 'std::__1::__function::__base<int ()>'
0x0001032f5ba0: note: object has invalid vptr
 00 00 00 00  75 73 65 72 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^~~~~~~~~~~~~~~~~~~~~~~
              invalid vptr
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /opt/local/libexec/llvm-11/bin/../include/c++/v1/functional:1884:16 in 
=================================================================
==48807==ERROR: AddressSanitizer: global-buffer-overflow on address 0x0001032f5ba0 at pc 0x0001051a0291 bp 0x700001b49230 sp 0x700001b49228
READ of size 8 at 0x0001032f5ba0 thread T12
    #0 0x1051a0290 in std::__1::__function::__value_func<int ()>::operator()() const functional:1884
    #1 0x10519d62c in Botan_FFI::ffi_guard_thunk(char const*, std::__1::function<int ()>) ffi.cpp:99
    #2 0x10534da40 in botan_rng_init ffi_rng.cpp:26
    #3 0x1022593e1 in Botan_Tests::(anonymous namespace)::FFI_Unit_Tests::run() test_ffi.cpp:46
    #4 0x1025ef0cc in Botan_Tests::(anonymous namespace)::run_a_test(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) test_runner.cpp:310
    #5 0x102608d54 in decltype(std::__1::forward<Botan_Tests::Test_Runner::run_tests(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, unsigned long, unsigned long, unsigned long)::$_1&>(fp)(std::__1::forward<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&>(fp0))) std::__1::__invoke<Botan_Tests::Test_Runner::run_tests(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, unsigned long, unsigned long, unsigned long)::$_1&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&>(Botan_Tests::Test_Runner::run_tests(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, unsigned long, unsigned long, unsigned long)::$_1&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&) type_traits:3899
    #6 0x102608cf5 in std::__1::__bind_return<Botan_Tests::Test_Runner::run_tests(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, unsigned long, unsigned long, unsigned long)::$_1, std::__1::tuple<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::tuple<>, __is_valid_bind_return<Botan_Tests::Test_Runner::run_tests(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, unsigned long, unsigned long, unsigned long)::$_1, std::__1::tuple<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::tuple<> >::value>::type std::__1::__apply_functor<Botan_Tests::Test_Runner::run_tests(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, unsigned long, unsigned long, unsigned long)::$_1, std::__1::tuple<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, 0ul, std::__1::tuple<> >(Botan_Tests::Test_Runner::run_tests(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, unsigned long, unsigned long, unsigned long)::$_1&, std::__1::tuple<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >&, std::__1::__tuple_indices<0ul>, std::__1::tuple<>&&) functional:2853
    #7 0x102608c63 in decltype(std::__1::forward<std::__1::__bind<Botan_Tests::Test_Runner::run_tests(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, unsigned long, unsigned long, unsigned long)::$_1&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&>&>(fp)()) std::__1::__invoke<std::__1::__bind<Botan_Tests::Test_Runner::run_tests(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, unsigned long, unsigned long, unsigned long)::$_1&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&>&>(std::__1::__bind<Botan_Tests::Test_Runner::run_tests(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, unsigned long, unsigned long, unsigned long)::$_1&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&>&) type_traits:3899
    #8 0x102608a16 in std::__1::__packaged_task_func<std::__1::__bind<Botan_Tests::Test_Runner::run_tests(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, unsigned long, unsigned long, unsigned long)::$_1&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&>, std::__1::allocator<std::__1::__bind<Botan_Tests::Test_Runner::run_tests(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, unsigned long, unsigned long, unsigned long)::$_1&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&> >, std::__1::vector<Botan_Tests::Test::Result, std::__1::allocator<Botan_Tests::Test::Result> > ()>::operator()() future:1817
    #9 0x102603ced in std::__1::__packaged_task_function<std::__1::vector<Botan_Tests::Test::Result, std::__1::allocator<Botan_Tests::Test::Result> > ()>::operator()() const future:1994
    #10 0x102603866 in std::__1::packaged_task<std::__1::vector<Botan_Tests::Test::Result, std::__1::allocator<Botan_Tests::Test::Result> > ()>::operator()() future:2085
    #11 0x105e1cfaf in Botan::Thread_Pool::worker_thread() thread_pool.cpp:133
    #12 0x105e21ed9 in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (Botan::Thread_Pool::*)(), Botan::Thread_Pool*> >(void*) thread:291
    #13 0x7fff204238fb in _pthread_start+0xdf (libsystem_pthread.dylib:x86_64+0x68fb)
    #14 0x7fff2041f442 in thread_start+0xe (libsystem_pthread.dylib:x86_64+0x2442)

0x0001032f5ba5 is located 0 bytes to the right of global variable '<string literal>' defined in 'src/tests/test_ffi.cpp:46:14' (0x1032f5ba0) of size 5
  '<string literal>' is ascii string 'user'
SUMMARY: AddressSanitizer: global-buffer-overflow functional:1884 in std::__1::__function::__value_func<int ()>::operator()() const
Shadow bytes around the buggy address:
  0x10002065eb20: 00 00 04 f9 f9 f9 f9 f9 00 00 00 00 00 00 00 00
  0x10002065eb30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10002065eb40: 00 00 f9 f9 f9 f9 f9 f9 00 00 00 00 00 00 00 00
  0x10002065eb50: 00 00 00 00 f9 f9 f9 f9 00 00 00 00 00 06 f9 f9
  0x10002065eb60: f9 f9 f9 f9 04 f9 f9 f9 f9 f9 f9 f9 00 07 f9 f9
=>0x10002065eb70: f9 f9 f9 f9[05]f9 f9 f9 f9 f9 f9 f9 00 00 02 f9
  0x10002065eb80: f9 f9 f9 f9 02 f9 f9 f9 f9 f9 f9 f9 00 00 00 00
  0x10002065eb90: 06 f9 f9 f9 f9 f9 f9 f9 00 f9 f9 f9 f9 f9 f9 f9
  0x10002065eba0: 00 00 01 f9 f9 f9 f9 f9 00 05 f9 f9 f9 f9 f9 f9
  0x10002065ebb0: 07 f9 f9 f9 f9 f9 f9 f9 00 00 00 00 02 f9 f9 f9
  0x10002065ebc0: f9 f9 f9 f9 05 f9 f9 f9 f9 f9 f9 f9 06 f9 f9 f9
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
Thread T12 created by T0 here:
    #0 0x108759c9a in wrap_pthread_create+0x5a (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x41c9a)
    #1 0x105e21b66 in std::__1::thread::thread<void (Botan::Thread_Pool::*)(), Botan::Thread_Pool*, void>(void (Botan::Thread_Pool::*&&)(), Botan::Thread_Pool*&&) thread:307
    #2 0x105e1c431 in Botan::Thread_Pool::Thread_Pool(std::__1::optional<unsigned long>) thread_pool.cpp:71
    #3 0x1025eb600 in Botan_Tests::Test_Runner::run_tests(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, unsigned long, unsigned long, unsigned long) test_runner.cpp:358
    #4 0x1025e8f35 in Botan_Tests::Test_Runner::run(Botan_Tests::Test_Options const&) test_runner.cpp:188
    #5 0x102026789 in main main.cpp:112
    #6 0x7fff2043ef3c in start+0x0 (libdyld.dylib:x86_64+0x15f3c)

==48807==ABORTING

Here's my config:

$ cat conf-out.txt 
   INFO: ./configure.py invoked with options "--prefix=/opt/local --with-os-features=security_framework,apple_keychain,commoncrypto,getentropy --with-commoncrypto --with-openssl --with-boost --with-lzma --with-bzip2 --with-zlib --with-sqlite3 --with-python-version=3.9 --with-sphinx --with-pdf --system-cert-bundle=/opt/local/share/curl/curl-ca-bundle.crt --cc-abi-flags=-march=native --no-optimizations --extra-cxxflags=-g -O3 -I/opt/local/include -fsanitize=undefined -fsanitize=address --ldflags=-fsanitize=undefined -fsanitize=address"
   INFO: Configuring to build Botan 3.0.0-alpha0 (revision git:8ee0c78c3f7561a2353bc69ac6b6baf27dbaa60e)
   INFO: Running under 3.9.7 (default, Sep  1 2021, 12:35:15) [Clang 12.0.5 (clang-1205.0.22.9)]
   INFO: Implicit --cc-bin=clang++ due to environment variable CXX
   INFO: Implicit --cxxflags= due to environment variable CXXFLAGS
   INFO: Autodetected platform information: OS="Darwin" machine="x86_64" proc="i386"
   INFO: Guessing target OS is darwin (use --os to set)
   INFO: Guessing target processor is a x86_64 (use --cpu to set)
   INFO: Auto-detected compiler version 11.1
   INFO: Auto-detected compiler arch x86_64
   INFO: Target is clang:11.1-macos-x86_64
   INFO: Assuming target x86_64 is little endian
   INFO: Skipping (incompatible CPU): aes_armv8 aes_power8 sha1_armv8 sha2_32_armv8 shacal2_armv8 sm4_armv8
   INFO: Skipping (incompatible OS): certstor_system_windows win32_stats
   INFO: Skipping (requires external dependency): tpm
   INFO: Enabling use of external dependency boost
   INFO: Enabling use of external dependency bzip2
   INFO: Enabling use of external dependency commoncrypto
   INFO: Enabling use of external dependency lzma
   INFO: Enabling use of external dependency openssl
   INFO: Enabling use of external dependency sqlite3
   INFO: Enabling use of external dependency zlib
   INFO: Loading modules: adler32 aead aes aes_ni aes_vperm argon2 argon2fmt aria asio asn1 auto_rng base base32 base58 base64 bcrypt bcrypt_pbkdf bigint blake2 blake2mac block blowfish boost bzip2 camellia cascade cast128 cbc ccm cecpq1 certstor_flatfile certstor_sql certstor_sqlite3 certstor_system certstor_system_macos cfb chacha chacha20poly1305 chacha_avx2 chacha_rng chacha_simd32 checksum cmac comb4p commoncrypto compression cpuid crc24 crc32 cryptobox ctr curve25519 des dh dl_algo dl_group dlies dsa dyn_load eax ec_group ec_h2c ecc_key ecdh ecdsa ecgdsa ecies eckcdsa ed25519 elgamal eme_oaep eme_pkcs1 eme_raw emsa1 emsa_pkcs1 emsa_pssr emsa_raw emsa_x931 entropy fd_unix ffi filters fpe_fe1 gcm getentropy ghash ghash_cpu ghash_vperm gmac gost_28147 gost_3410 gost_3411 hash hash_id hex hkdf hmac hmac_drbg hotp http_util idea idea_sse2 iso9796 kdf kdf1 kdf1_iso18033 kdf2 keccak keypair lion locking_allocator lzma mac mce md4 md5 mdx_hash mem_pool mgf1 mode_pad modes mp newhope nist_keywrap noekeon noekeon_simd numbertheory ocb ofb openssl par_hash passhash9 pbes2 pbkdf pbkdf2 pem pgp_s2k pk_pad pkcs11 poly1305 poly_dbl prf_tls prf_x942 processor_rng psk_db pubkey rc4 rdseed rfc3394 rfc6979 rmd160 rng roughtime rsa salsa20 scrypt seed serpent serpent_avx2 serpent_simd sessions_sql sessions_sqlite3 sha1 sha1_sse2 sha1_x86 sha2_32 sha2_32_bmi2 sha2_32_x86 sha2_64 sha2_64_bmi2 sha3 sha3_bmi2 shacal2 shacal2_avx2 shacal2_simd shacal2_x86 shake shake_cipher simd simd_avx2 siphash siv skein sm2 sm3 sm4 socket sodium sp800_108 sp800_56a sp800_56c sqlite3 srp6 stateful_rng stream streebog system_rng thread_utils threefish_512 threefish_512_avx2 tls tls_cbc tss twofish utils uuid whirlpool x509 x919_mac xmss xts zlib
   INFO: Using symlink to link files into build dir (use --link-method to change)
   INFO: Botan 3.0.0-alpha0 (revision git:8ee0c78c3f7561a2353bc69ac6b6baf27dbaa60e) (unreleased undated) build setup is complete

Here's build: make-out.txt

And the tests output (brief now ;): test-out.txt

reneme commented 2 years ago

Alright, I think I'm digging much deeper than I should. Nevertheless, some printf-debugging suggests that the optimization breaks either the for-loop in SHA_3::permute() or SHA3_round() in sha3.cpp.

I instrumented said for-loop like so:

// helper function
void print_array(uint64_t S[25])
  {
  for (unsigned int i = 0; i < 25; ++i)
    {
    std::cout << i << ": " <<  S[i] << '\n';
    }
  std::cout << '\n';
  }

// loop in SHA_3::permute()
for(size_t i = 0; i != 24; i += 2)
  {
  SHA3_round(T, A, RC[i+0]);

  std::cout << i << '\n';
  print_array(T);

  SHA3_round(A, T, RC[i+1]);

  std::cout << i << '\n';
  print_array(A);
  }

Curiously, the first invocation of SHA3_round() (in the first for-loop iteration) produces a consistent output both with -O3 and -O0:

-O3                                 -O0

0
0: 15515230172486                   0: 15515230172486
1: 9751542238472685244              1: 9751542238472685244
2: 220181482233372672               2: 220181482233372672
3: 2303197730119                    3: 2303197730119
4: 9537012007446913720              4: 9537012007446913720
5: 0                                5: 0
6: 14782389640143539577             6: 14782389640143539577
7: 2305843009213693952              7: 2305843009213693952
8: 1056340403235818873              8: 1056340403235818873
9: 16396894922196123648             9: 16396894922196123648
10: 13438274300558                  10: 13438274300558
11: 3440198220943040                11: 3440198220943040
12: 0                               12: 0
13: 3435902021559310                13: 3435902021559310
14: 64                              14: 64
15: 14313837075027532897            15: 14313837075027532897
16: 32768                           16: 32768
17: 6880396441885696                17: 6880396441885696
18: 14320469711924527201            18: 14320469711924527201
19: 0                               19: 0
20: 9814829303127743595             20: 9814829303127743595
21: 18014398509481984               21: 18014398509481984
22: 14444556046857390455            22: 14444556046857390455
23: 4611686018427387904             23: 4611686018427387904
24: 18041275058083100               24: 18041275058083100

While the second invocation of SHA3_round() (still in the first for-loop iteration) diverges with optimizations turned on:

-O3                                 -O0

0: 17575451156447854761             0: 16394434931424703552
1: 14098857779569698676             1: 10202638136074191489
2: 15203366283230579135             2: 6432602484395933614
3: 7281964473540433968              3: 10616058301262943899
4: 15263754171087725068             4: 14391824303596635982
5: 5633660519975424370              5: 5673590995284149638
6: 9460528598356748811              6: 15681872423764765508
7: 11341541901141834517             7: 11470206704342013341
8: 13617408866846456712             8: 8508807405493883168
9: 9482641616152924955              9: 9461805213344568570
10: 4601466176962577084             10: 8792313850970105187
11: 10116897752656313514            11: 13508586629627657374
12: 2660843814697583169             12: 5157283382205130943
13: 4401623780961015245             13: 375019647457809685
14: 9331863398332548709             14: 9294608398083155963
15: 13375032266224287796            15: 16923121173371064314
16: 17332059665275230084            16: 4737739424553008030
17: 14755860832370067683            17: 5823987023293412593
18: 14224301065282086166            18: 13908063749137376267
19: 9839923188256068435             19: 13781177305593198238
20: 13205233061921415593            20: 9673833001659673401
21: 18208436141457192220            21: 17282395057630454440
22: 1926250569723249352             22: 12906624984756985556
23: 2939545859708404818             23: 3081478361927354234
24: 7729773820470333198             24: 93297594635310132

Now, frankly, I find this rather strange. SHA3_round() does a whole lot of bitwise operations, but no branches or anything. My gut feeling would be, that it should produce consistent outputs for the first and second invocation. So right now, my working hypothesis is that the compiler somehow screws up the inlining of SHA3_round() into the for loop.

reneme commented 2 years ago

Turns out: After moving SHA3_round() into another compilation unit (avoiding the inline) the problem persists. But if I exclusively rebuild the new compilation unit (containing SHA3_round()) and relink the rest, the result is correct. So the optimizer seems to trip over the bitshift stuff in SHA3_round() after all.

reneme commented 2 years ago

Okay, this is becoming a goose chase, I'm afraid. I compared the input values in A of the above mentioned second invocation of SHA3_round() in the first iteration of the for loop in SHA_3::permute(). The 25 array values are consistent across -O0 and -O3. However, the first value to deviate is C2 calculated here. Again: the input values in A are the same for -O0 and -O3, I compared it time and again. Though C2 is "16961422039339595127" for -O0 and 2528292126282103808 for -O3 (and -O2, FWIW).

Enough for today...

reneme commented 2 years ago

Here's a minimal example (that doesn't depend on Botan), reproducing the discrepancy.

With -O1 (and apple clang 13) it works fine and the assertions at the end check out; for -O2 and above they don't. 🤡

#include <cstdint>
#include <cassert>

template<size_t ROT, typename T>
inline constexpr T rotl(T input)
   {
   static_assert(ROT > 0 && ROT < 8*sizeof(T), "Invalid rotation constant");
   return static_cast<T>((input << ROT) | (input >> (8*sizeof(T) - ROT)));
   }

inline void SHA3_round(uint64_t T[25], const uint64_t A[25], uint64_t RC)
   {
   const uint64_t C0 = A[0] ^ A[5] ^ A[10] ^ A[15] ^ A[20];
   const uint64_t C1 = A[1] ^ A[6] ^ A[11] ^ A[16] ^ A[21];

   // the calculation of C2 fails for -O3 or -O2 with clang 12
   // FWIW: it would produce a value that doesn't fit into a _signed_ 64-bit int
   const uint64_t C2 = A[2] ^ A[7] ^ A[12] ^ A[17] ^ A[22];

   const uint64_t C3 = A[3] ^ A[8] ^ A[13] ^ A[18] ^ A[23];
   const uint64_t C4 = A[4] ^ A[9] ^ A[14] ^ A[19] ^ A[24];

   const uint64_t D0 = rotl<1>(C0) ^ C3;
   const uint64_t D1 = rotl<1>(C1) ^ C4;
   const uint64_t D2 = rotl<1>(C2) ^ C0;
   const uint64_t D3 = rotl<1>(C3) ^ C1;
   const uint64_t D4 = rotl<1>(C4) ^ C2;

   const uint64_t B00 =          A[ 0] ^ D1;
   const uint64_t B01 = rotl<44>(A[ 6] ^ D2);
   const uint64_t B02 = rotl<43>(A[12] ^ D3);
   const uint64_t B03 = rotl<21>(A[18] ^ D4);
   const uint64_t B04 = rotl<14>(A[24] ^ D0);
   T[ 0] = B00 ^ (~B01 & B02) ^ RC;
   T[ 1] = B01 ^ (~B02 & B03);
   T[ 2] = B02 ^ (~B03 & B04);
   T[ 3] = B03 ^ (~B04 & B00);
   T[ 4] = B04 ^ (~B00 & B01);

   const uint64_t B05 = rotl<28>(A[ 3] ^ D4);
   const uint64_t B06 = rotl<20>(A[ 9] ^ D0);
   const uint64_t B07 = rotl< 3>(A[10] ^ D1);
   const uint64_t B08 = rotl<45>(A[16] ^ D2);
   const uint64_t B09 = rotl<61>(A[22] ^ D3);
   T[ 5] = B05 ^ (~B06 & B07);
   T[ 6] = B06 ^ (~B07 & B08);
   T[ 7] = B07 ^ (~B08 & B09);
   T[ 8] = B08 ^ (~B09 & B05);
   T[ 9] = B09 ^ (~B05 & B06);

   // --- instructions starting from here can be removed
   //     and the -O3 dicrepancy is still triggered

   const uint64_t B10 = rotl< 1>(A[ 1] ^ D2);
   const uint64_t B11 = rotl< 6>(A[ 7] ^ D3);
   const uint64_t B12 = rotl<25>(A[13] ^ D4);
   const uint64_t B13 = rotl< 8>(A[19] ^ D0);
   const uint64_t B14 = rotl<18>(A[20] ^ D1);
   T[10] = B10 ^ (~B11 & B12);
   T[11] = B11 ^ (~B12 & B13);
   T[12] = B12 ^ (~B13 & B14);
   T[13] = B13 ^ (~B14 & B10);
   T[14] = B14 ^ (~B10 & B11);

   const uint64_t B15 = rotl<27>(A[ 4] ^ D0);
   const uint64_t B16 = rotl<36>(A[ 5] ^ D1);
   const uint64_t B17 = rotl<10>(A[11] ^ D2);
   const uint64_t B18 = rotl<15>(A[17] ^ D3);
   const uint64_t B19 = rotl<56>(A[23] ^ D4);
   T[15] = B15 ^ (~B16 & B17);
   T[16] = B16 ^ (~B17 & B18);
   T[17] = B17 ^ (~B18 & B19);
   T[18] = B18 ^ (~B19 & B15);
   T[19] = B19 ^ (~B15 & B16);

   const uint64_t B20 = rotl<62>(A[ 2] ^ D3);
   const uint64_t B21 = rotl<55>(A[ 8] ^ D4);
   const uint64_t B22 = rotl<39>(A[14] ^ D0);
   const uint64_t B23 = rotl<41>(A[15] ^ D1);
   const uint64_t B24 = rotl< 2>(A[21] ^ D2);
   T[20] = B20 ^ (~B21 & B22);
   T[21] = B21 ^ (~B22 & B23);
   T[22] = B22 ^ (~B23 & B24);
   T[23] = B23 ^ (~B24 & B20);
   T[24] = B24 ^ (~B20 & B21);
   }

int main()
{
    uint64_t T[25];

    uint64_t A[25] = {
        15515230172486u, 9751542238472685244u, 220181482233372672u,
        2303197730119u, 9537012007446913720u, 0u, 14782389640143539577u,
        2305843009213693952u, 1056340403235818873u, 16396894922196123648u,
        13438274300558u, 3440198220943040u, 0u, 3435902021559310u, 64u,
        14313837075027532897u, 32768u, 6880396441885696u, 14320469711924527201u,
        0u, 9814829303127743595u, 18014398509481984u, 14444556046857390455u,
        4611686018427387904u, 18041275058083100u };

    SHA3_round(T, A, 0x0000000000008082);

    assert(T[0]  == 16394434931424703552u);
    assert(T[1]  == 10202638136074191489u);
    assert(T[2]  == 6432602484395933614u);
    assert(T[3]  == 10616058301262943899u);
    assert(T[4]  == 14391824303596635982u);
    assert(T[5]  == 5673590995284149638u);
    assert(T[6]  == 15681872423764765508u);
    assert(T[7]  == 11470206704342013341u);
    assert(T[8]  == 8508807405493883168u);
    assert(T[9]  == 9461805213344568570u);
    assert(T[10] == 8792313850970105187u);
    assert(T[11] == 13508586629627657374u);
    assert(T[12] == 5157283382205130943u);
    assert(T[13] == 375019647457809685u);
    assert(T[14] == 9294608398083155963u);
    assert(T[15] == 16923121173371064314u);
    assert(T[16] == 4737739424553008030u);
    assert(T[17] == 5823987023293412593u);
    assert(T[18] == 13908063749137376267u);
    assert(T[19] == 13781177305593198238u);
    assert(T[20] == 9673833001659673401u);
    assert(T[21] == 17282395057630454440u);
    assert(T[22] == 12906624984756985556u);
    assert(T[23] == 3081478361927354234u);
    assert(T[24] == 93297594635310132u);

    return 0;
}
mouse07410 commented 2 years ago

With -O1 (and apple clang 13) it works fine and the assertions at the end check out; for -O2 and above they don't

Great investigation!

Also, note that I'm getting the same results with another Clang-12 compilers, not only Xcode.

But Clang-11 from Macports seems to compile and run your reproducer correctly.

hrantzsch commented 2 years ago

FWIW I cannot reproduce the issue with Clang 12 on Linux.

reneme commented 2 years ago
Tests complete ran 2675970 tests in 8.54 sec all tests ok

🎉🎉

reneme commented 2 years ago

For reference, bug report to llvm.org: https://bugs.llvm.org/show_bug.cgi?id=51957

mouse07410 commented 2 years ago

I confirm that it works on MacOS-11.6 with Xcode-13 and -Ofast -Os optimizations turned on.

Thank you! Let's merge, and back-port to release-2, as it has the same problem.

DimitryAndric commented 2 years ago

Here's a minimal example (that doesn't depend on Botan), reproducing the discrepancy.

With -O1 (and apple clang 13) it works fine and the assertions at the end check out; for -O2 and above they don't. 🤡

Interesting, I cannot reproduce this with Apple clang version 13.0.0 (clang-1300.0.29.3), nor with FreeBSD clang 13.0.0-rc3-8-g08642a395f23. I.e., none of the assertions trigger. (Note that Apple's versions don't exactly correspond to upstream LLVM versions.)

Do you know of any non-Mac clang version that miscompiles the test case, and what exact compile flags are being used?

mouse07410 commented 2 years ago

I cannot reproduce this with . . .

I suspect it's the combination of the "right" compiler and CPU (see below).

I know that at least two independent developers here reproduced the problem, with Xcode-13 (Apple clang version 13.0.0 (clang-1300.0.29.3)), and with Macports clang version 12.0.1 (Macports clang version 11.1.0 is "immune", as were previous releases).

I also know that LLVM developers confirmed the issue, and bisected it to SLP Vectorizer.

@reneme 's reproducer "reproduces" the problem with practically no flags at all:

$ clang++ -v
Apple clang version 13.0.0 (clang-1300.0.29.3)
Target: x86_64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
$ CXXFLAGS="" CFLAGS="" CPPFLAGS="" clang++ -std=gnu++17 -o s -O2 sha3-reproducer.cxx 
$ ./s
Assertion failed: (T[0] == 16394434931424703552u), function main, file sha3-reproducer.cxx, line 103.
Abort trap: 6
$ CXXFLAGS="" CFLAGS="" CPPFLAGS="" clang++ -std=gnu++14 -o s -O2 sha3-reproducer.cxx 
$ ./s
Assertion failed: (T[0] == 16394434931424703552u), function main, file sha3-reproducer.cxx, line 103.
Abort trap: 6
$ CXXFLAGS="" CFLAGS="" CPPFLAGS="" clang++ -std=gnu++11 -o s -O2 sha3-reproducer.cxx 
$ ./s
Assertion failed: (T[0] == 16394434931424703552u), function main, file sha3-reproducer.cxx, line 103.
Abort trap: 6
$ CXXFLAGS="" CFLAGS="" CPPFLAGS="" clang++ -std=c++11 -o s -O2 sha3-reproducer.cxx 
$ ./s
Assertion failed: (T[0] == 16394434931424703552u), function main, file sha3-reproducer.cxx, line 103.
Abort trap: 6
$ CXXFLAGS="" CFLAGS="" CPPFLAGS="" clang++ -std=gnu++17 -o s -O1 sha3-reproducer.cxx 
$ ./s
$ 

Note: without -std=... it fails to compile:

$ CXXFLAGS="" CFLAGS="" CPPFLAGS="" clang++ -o s -O1 sha3-reproducer.cxx 
sha3-reproducer.cxx:5:8: error: unknown type name 'constexpr'
inline constexpr T rotl(T input)
       ^
sha3-reproducer.cxx:5:18: warning: variable templates are a C++14 extension [-Wc++14-extensions]
inline constexpr T rotl(T input)
                 ^
sha3-reproducer.cxx:5:1: warning: inline variables are a C++17 extension [-Wc++17-extensions]
inline constexpr T rotl(T input)
^
sha3-reproducer.cxx:5:19: error: expected ';' at end of declaration
inline constexpr T rotl(T input)
                  ^
                  ;
sha3-reproducer.cxx:5:25: error: unknown type name 'T'
inline constexpr T rotl(T input)
                        ^
sha3-reproducer.cxx:5:20: error: C++ requires a type specifier for all declarations
inline constexpr T rotl(T input)
                   ^
sha3-reproducer.cxx:7:29: error: use of undeclared identifier 'ROT'
   static_assert(ROT > 0 && ROT < 8*sizeof(T), "Invalid rotation constant");
                            ^
sha3-reproducer.cxx:7:18: error: use of undeclared identifier 'ROT'
   static_assert(ROT > 0 && ROT < 8*sizeof(T), "Invalid rotation constant");
                 ^
sha3-reproducer.cxx:8:23: error: unknown type name 'T'
   return static_cast<T>((input << ROT) | (input >> (8*sizeof(T) - ROT)));
                      ^
sha3-reproducer.cxx:8:36: error: use of undeclared identifier 'ROT'
   return static_cast<T>((input << ROT) | (input >> (8*sizeof(T) - ROT)));
                                   ^
sha3-reproducer.cxx:8:68: error: use of undeclared identifier 'ROT'
   return static_cast<T>((input << ROT) | (input >> (8*sizeof(T) - ROT)));
                                                                   ^
2 warnings and 9 errors generated.
$ 

The problem obviously is (also?) CPU-related, because "downgrading" CPU arch to, e.g., core2 lets the reproducer pass:

$ CXXFLAGS="" CFLAGS="" CPPFLAGS="" clang++ -march=core2 -std=gnu++17 -o s -O3 sha3-reproducer.cxx 
$ ./s
$ CXXFLAGS="" CFLAGS="" CPPFLAGS="" clang++ -march=native -std=gnu++17 -o s -O3 sha3-reproducer.cxx 
$ ./s
Assertion failed: (T[0] == 16394434931424703552u), function main, file sha3-reproducer.cxx, line 103.
Abort trap: 6
$ 
DimitryAndric commented 2 years ago

I suspect it's the combination of the "right" compiler and CPU (see below).

Ah yes, I failed to mention that I was compiling this on a M1 Mac, so targeting arm64. As a data point, that works correctly in any case, both for the minimized test case, and when I do a full configure/make/make check.

The problem obviously is (also?) CPU-related, because "downgrading" CPU arch to, e.g., core2 lets the reproducer pass:


$ CXXFLAGS="" CFLAGS="" CPPFLAGS="" clang++ -march=core2 -std=gnu++17 -o s -O3 sha3-reproducer.cxx 
$ ./s
$ CXXFLAGS="" CFLAGS="" CPPFLAGS="" clang++ -march=native -std=gnu++17 -o s -O3 sha3-reproducer.cxx 
$ ./s
Assertion failed: (T[0] == 16394434931424703552u), function main, file sha3-reproducer.cxx, line 103.
Abort trap: 6

Right, so what is interesting is which CPU it detects natively, and therefore which CPU extensions it enables. Can you run these same clang commands with -v added? Then it should all its intermediate options, one of which is -target-cpu xxx, identifying the detected CPU. I think I'm now suspecting you are running into some sort of SSE or AVX specific bug.

mouse07410 commented 2 years ago

what is interesting is which CPU it detects natively, and therefore which CPU extensions it enables. Can you run these same clang commands with -v added? Then it should all its intermediate options, one of which is -target-cpu xxx, identifying the detected CPU. I think I'm now suspecting you are running into some sort of SSE or AVX specific bug.

Did I mention that the problem was bisected to the SLP Vectorizer in LLVM?

Regardless,

$ CXXFLAGS="" CFLAGS="" CPPFLAGS="" clang++ -v -march=native -std=gnu++17 -o s -O3 sha3-reproducer.cxx 
Apple clang version 13.0.0 (clang-1300.0.29.3)
Target: x86_64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
 "/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang" -cc1 -triple x86_64-apple-macosx11.0.0 -Wundef-prefix=TARGET_OS_ -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -Werror=implicit-function-declaration -emit-obj --mrelax-relocations -disable-free -disable-llvm-verifier -discard-value-names -main-file-name sha3-reproducer.cxx -mrelocation-model pic -pic-level 2 -mframe-pointer=all -fno-strict-return -fno-rounding-math -munwind-tables -target-sdk-version=11.3 -fvisibility-inlines-hidden-static-local-var -target-cpu skylake-avx512 -target-feature +sse2 -target-feature -tsxldtrk -target-feature +cx16 -target-feature +sahf -target-feature -tbm -target-feature -avx512ifma -target-feature -sha -target-feature -gfni -target-feature -fma4 -target-feature -vpclmulqdq -target-feature +prfchw -target-feature +bmi2 -target-feature -cldemote -target-feature +fsgsbase -target-feature -ptwrite -target-feature -amx-tile -target-feature -uintr -target-feature +popcnt -target-feature -widekl -target-feature +aes -target-feature -avx512bitalg -target-feature -movdiri -target-feature +xsaves -target-feature -avx512er -target-feature -avxvnni -target-feature -avx512vnni -target-feature -amx-bf16 -target-feature -avx512vpopcntdq -target-feature -pconfig -target-feature +clwb -target-feature +avx512f -target-feature +xsavec -target-feature -clzero -target-feature -pku -target-feature +mmx -target-feature -lwp -target-feature -rdpid -target-feature -xop -target-feature +rdseed -target-feature -waitpkg -target-feature -kl -target-feature -movdir64b -target-feature -sse4a -target-feature +avx512bw -target-feature +clflushopt -target-feature +xsave -target-feature -avx512vbmi2 -target-feature +64bit -target-feature +avx512vl -target-feature -serialize -target-feature -hreset -target-feature +invpcid -target-feature +avx512cd -target-feature +avx -target-feature -vaes -target-feature -avx512bf16 -target-feature +cx8 -target-feature +fma -target-feature +rtm -target-feature +bmi -target-feature -enqcmd -target-feature +rdrnd -target-feature -mwaitx -target-feature +sse4.1 -target-feature +sse4.2 -target-feature +avx2 -target-feature +fxsr -target-feature -wbnoinvd -target-feature +sse -target-feature +lzcnt -target-feature +pclmul -target-feature -prefetchwt1 -target-feature +f16c -target-feature +ssse3 -target-feature -sgx -target-feature -shstk -target-feature +cmov -target-feature -avx512vbmi -target-feature -amx-int8 -target-feature +movbe -target-feature -avx512vp2intersect -target-feature +xsaveopt -target-feature +avx512dq -target-feature +adx -target-feature -avx512pf -target-feature +sse3 -debugger-tuning=lldb -target-linker-version 711 -v -resource-dir /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/13.0.0 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk -I/usr/local/include -stdlib=libc++ -internal-isystem /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/v1 -internal-isystem /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/local/include -internal-isystem /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/13.0.0/include -internal-externc-isystem /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include -internal-externc-isystem /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include -O3 -Wno-reorder-init-list -Wno-implicit-int-float-conversion -Wno-c99-designator -Wno-final-dtor-non-final-class -Wno-extra-semi-stmt -Wno-misleading-indentation -Wno-quoted-include-in-framework-header -Wno-implicit-fallthrough -Wno-enum-enum-conversion -Wno-enum-float-conversion -Wno-elaborated-enum-base -std=gnu++17 -fdeprecated-macro -fdebug-compilation-dir /Users/ur20980/src -ferror-limit 19 -stack-protector 1 -fstack-check -mdarwin-stkchk-strong-link -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 -fno-cxx-modules -fcxx-exceptions -fexceptions -fmax-type-align=16 -fcommon -fcolor-diagnostics -vectorize-loops -vectorize-slp -clang-vendor-feature=+nullptrToBoolConversion -clang-vendor-feature=+messageToSelfInClassMethodIdReturnType -clang-vendor-feature=+disableInferNewAvailabilityFromInit -clang-vendor-feature=+disableNeonImmediateRangeCheck -clang-vendor-feature=+disableNonDependentMemberExprInCurrentInstantiation -fno-odr-hash-protocols -clang-vendor-feature=+revert09abecef7bbf -mllvm -disable-aligned-alloc-awareness=1 -mllvm -enable-dse-memoryssa=0 -o /var/folders/c6/lnc_0m093ys8w16md_fm1mnxhtfnj8/T/sha3-reproducer-6eea93.o -x c++ sha3-reproducer.cxx
clang -cc1 version 13.0.0 (clang-1300.0.29.3) default target x86_64-apple-darwin20.6.0
ignoring nonexistent directory "/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/local/include"
ignoring nonexistent directory "/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/Library/Frameworks"
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/include
 /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/v1
 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/13.0.0/include
 /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include
 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include
 /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks (framework directory)
End of search list.
 "/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld" -demangle -lto_library /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/libLTO.dylib -dynamic -arch x86_64 -platform_version macos 11.0.0 11.3 -syslibroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk -o s -L/usr/local/lib /var/folders/c6/lnc_0m093ys8w16md_fm1mnxhtfnj8/T/sha3-reproducer-6eea93.o -lc++ -lSystem /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/13.0.0/lib/darwin/libclang_rt.osx.a
$ ./s
Assertion failed: (T[0] == 16394434931424703552u), function main, file sha3-reproducer.cxx, line 103.
Abort trap: 6
$ 

Of course, the above doesn't mean the problem is limited to -target-cpu skylake-avx512 (or -target-cpu penryn, or -target-cpu=sandybridge, or -target-cpu=opteron, or...).

DimitryAndric commented 2 years ago

Did I mention that the problem was bisected to the SLP Vectorizer in LLVM?

Yes, but I'm not aware of any particular LLVM commit that caused it; there isn't any further information in https://bugs.llvm.org/show_bug.cgi?id=51957, maybe you got that via other channels?

The main thing is whether this is an Apple specific change that broke something, or whether it is also reproducible in vanilla LLVM, if you specify e.g. -target-cpu skylake-avx512 like you showed.

mouse07410 commented 2 years ago

The problem disappears with -mno-sse4.1, and manifests otherwise.

On Linux, Clang-12 appears to not set -msse4.1 on by default, so the reproducer compiles and runs fine. However, if I explicitly specify -msse4.1, the reproducer fails assertions just like on the Mac:

$ clang++ -v -msse4.1 -std=gnu++17 -O3 -o s sha3-reproducer.cxx 
Ubuntu clang version 12.0.0-3ubuntu1~20.04.4
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/9
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/9
Selected GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/9
Candidate multilib: .;@m64
Selected multilib: .;@m64
 "/usr/lib/llvm-12/bin/clang" -cc1 -triple x86_64-pc-linux-gnu -emit-obj --mrelax-relocations -disable-free -disable-llvm-verifier -discard-value-names -main-file-name sha3-reproducer.cxx -mrelocation-model static -mframe-pointer=none -fmath-errno -fno-rounding-math -mconstructor-aliases -munwind-tables -target-cpu x86-64 -target-feature +sse4.1 -tune-cpu generic -fno-split-dwarf-inlining -debugger-tuning=gdb -v -resource-dir /usr/lib/llvm-12/lib/clang/12.0.0 -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9 -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/x86_64-linux-gnu/c++/9 -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/x86_64-linux-gnu/c++/9 -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/backward -internal-isystem /usr/local/include -internal-isystem /usr/lib/llvm-12/lib/clang/12.0.0/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -std=gnu++17 -fdeprecated-macro -fdebug-compilation-dir /home/ur20980/src -ferror-limit 19 -fgnuc-version=4.2.1 -fcxx-exceptions -fexceptions -fcolor-diagnostics -vectorize-loops -vectorize-slp -faddrsig -o /tmp/sha3-reproducer-7717b4.o -x c++ sha3-reproducer.cxx
clang -cc1 version 12.0.0 based upon LLVM 12.0.0 default target x86_64-pc-linux-gnu
ignoring nonexistent directory "/include"
ignoring duplicate directory "/usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/x86_64-linux-gnu/c++/9"
#include "..." search starts here:
#include <...> search starts here:
 /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9
 /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/x86_64-linux-gnu/c++/9
 /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/backward
 /usr/local/include
 /usr/lib/llvm-12/lib/clang/12.0.0/include
 /usr/include/x86_64-linux-gnu
 /usr/include
End of search list.
 "/usr/bin/ld" -z relro --hash-style=gnu --build-id --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o s /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/crt1.o /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/crti.o /usr/bin/../lib/gcc/x86_64-linux-gnu/9/crtbegin.o -L/usr/bin/../lib/gcc/x86_64-linux-gnu/9 -L/usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu -L/usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../lib64 -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib64 -L/usr/lib/x86_64-linux-gnu/../../lib64 -L/usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../.. -L/usr/lib/llvm-12/bin/../lib -L/lib -L/usr/lib /tmp/sha3-reproducer-7717b4.o -lstdc++ -lm -lgcc_s -lgcc -lc -lgcc_s -lgcc /usr/bin/../lib/gcc/x86_64-linux-gnu/9/crtend.o /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/crtn.o
$ ./s
s: sha3-reproducer.cxx:105: int main(): Assertion `T[0] == 16394434931424703552u' failed.
Aborted (core dumped)
DimitryAndric commented 2 years ago

The problem disappears with -mno-sse4.1, and manifests otherwise.

On Linux, Clang-12 appears to not set -msse4.1 on by default, so the reproducer compiles and runs fine.

Note that the original poster of this bug used -march=native. Most likely, this was run on a machine capable of SSE4.1, so that was why it got enabled by default. If you don't specify any particular -march= option, clang will target a 'generic' x86_64 CPU which will have SSE2 but not anything higher.

mouse07410 commented 2 years ago

Note that the original poster of this bug used -march=native.

Yes, I did. ;-) But the reproducer fails (with optimization) regardless of whether -march=native was specified or not.

Most likely, this was run on a machine capable of SSE4.1, so that was why it got enabled by default.

Yes, it was - but so was the machine that Linux runs on (although it is in a VM ;).

Regardless, it looks like we have now the exact Intel extension (SSE4.1) that LLVM Clang-12 fails to properly generate optimized code for on x86_64?

vtjnash commented 2 years ago

I merged a commit upstream on Monday that may fix this: https://reviews.llvm.org/D106613

reneme commented 2 years ago

For what it's worth: The clang (trunk) build available on Compiler Explorer indeed produces the correct output. As of this writing, the clang (trunk) build was fcdefc8 committed earlier today and hence probably included your upstream change from Monday. Thanks a lot, everybody!

randombit commented 2 years ago

Fixed now in master and in 2.18.2 thanks for reporting this @mouse07410. @reneme + @hrantzsch thanks for digging in and finding the problem. And also thank you to @vtjnash for fixing the Clang bug!