Bad codegen after 8adfa29706e (NaN constant folding weirdness)

dyung commented 1 year ago

We have an internal test that recently started to fail which I bisected back to 8adfa29706e5407b62a4726e2172894e0dfdc1e8.

I was able to reduce the test a little bit to the following:

extern "C" void printf(...);
typedef float __m128 __attribute__((__vector_size__(16)));
typedef int __v8su __attribute__((__vector_size__(32)));
typedef float __m256 __attribute__((__vector_size__(32)));
__m256 _mm256_max_ps___b, _mm256_hadd_ps___b, _mm256_hsub_ps___b,
    test89___trans_tmp_15, test89___trans_tmp_14, test89___trans_tmp_13,
    test89___trans_tmp_12, test89___trans_tmp_11, test89___trans_tmp_10,
    test89___trans_tmp_8, test89___trans_tmp_7, test89___trans_tmp_6,
    test89___trans_tmp_5, test89___trans_tmp_4, test89___trans_tmp_3,
    test89_id18854, test89_id18860, test89_id18872, test89_id18873;
template <typename T> T zero_upper(T in, unsigned) { return in; }
void init(char pred, void *data, unsigned size) {
  unsigned char *bytes = (unsigned char *)data;
  for (unsigned i = 0; i != size; ++i)
    bytes[i] = pred + i;
}
typedef long __attribute__((ext_vector_type(2))) ll2;
ll2 test89_id18839 = -1964383749;
__m128 test89_id18845, test89_id18879, test89_id18881;
void test89() {
  test89___trans_tmp_3 = __builtin_ia32_vcvtph2ps256(test89_id18839);
  init(69, &test89_id18845, sizeof(test89_id18845));
  test89___trans_tmp_4 = __builtin_shufflevector(test89_id18845, test89_id18845,
                                                 0, 1, 2, 3, 1, 1, 1, 1);
  __m256 id18844, id18870;
  test89___trans_tmp_5 = __builtin_ia32_rcpps256(id18844);
  test89___trans_tmp_6 =
      __builtin_ia32_maxps256(test89___trans_tmp_5, _mm256_max_ps___b);
  init(211, &test89_id18854, sizeof(test89_id18854));
  test89___trans_tmp_7 = (__v8su)test89___trans_tmp_6 & (__v8su)test89_id18854;
  init(205, &test89_id18860, sizeof(test89_id18860));
  test89___trans_tmp_8 =
      __builtin_ia32_hsubps256(test89_id18860, _mm256_hsub_ps___b);
  for (int id18871_idx = 0; id18871_idx < 92; ++id18871_idx) {
    init(220, &test89_id18872, sizeof(test89_id18872));
    id18870 *= test89_id18872;
  }
  init(220, &test89_id18873, sizeof(test89_id18873));
  __m128 id18878;
  init(252, &id18878, sizeof(id18878));
  for (int id18880_idx = 0; id18880_idx < 31; ++id18880_idx)
    test89_id18879 -= test89_id18881;
  __m128 __a = id18878;
  __a[0] += test89_id18881[0];
  test89___trans_tmp_10 =
      __builtin_shufflevector(__a, __a, 0, 1, 2, 3, 1, 1, 1, 1);
  __m256 id18874(zero_upper(test89___trans_tmp_10, 8));
  test89___trans_tmp_11 =
      __builtin_ia32_blendvps256(id18870, test89_id18873, id18874);
  test89___trans_tmp_12 = __builtin_shufflevector(
      test89___trans_tmp_8, test89___trans_tmp_11, 0, 8, 1, 1, 4, 2, 1, 1);
  test89___trans_tmp_13 =
      __builtin_ia32_haddps256(test89___trans_tmp_12, _mm256_hadd_ps___b);
  test89___trans_tmp_14 =
      ~(__v8su)test89___trans_tmp_7 & (__v8su)test89___trans_tmp_13;
  test89___trans_tmp_15 =
      (__v8su)test89___trans_tmp_3 | (__v8su)test89___trans_tmp_14;
  printf("%f\n", test89___trans_tmp_15[0]);
}
int main() { test89(); }

When the above code is compiled with optimizations targeting btver2 (-O2 -march=btver2), it generates a different value after 8adfa29706e5407b62a4726e2172894e0dfdc1e8:

$ ~/src/upstream/ffe1661fabc9cf379a10a0bf15268c6549e4836f-linux/bin/clang++ -O2 -march=btver2 test.cpp -o test.good.elf
$ ./test.good.elf
-268361104.000000
$ ~/src/upstream/8adfa29706e5407b62a4726e2172894e0dfdc1e8-linux/bin/clang++ -O2 -march=btver2 test.cpp -o test.bad.elf
$ ./test.bad.elf
-3701730859659994532950310912.000000

Here is a link to godbolt showing the output difference between trunk and LLVM 15: https://godbolt.org/z/E63hbbTfs

thesamesam commented 1 year ago

cc @LebedevRI

LebedevRI commented 1 year ago

Thank you, i will take a look. I have a suspicion it's going to end up pointing at X86 shuffle combines though (CC @RKSimon)

LebedevRI commented 1 year ago

@dyung running opt pipeline on that example even with clang-15 produces again different results: https://godbolt.org/z/Kd1WbfW5d Are you //sure// there is no usual FP brokenness going on in that example?

LebedevRI commented 1 year ago

So in that new additional SROA run, we promote %id18878.i = alloca <4 x float>, align 16. Said promotion looks obviously correct (C) to me. I think this is the important bit:

    rewriting [0,16) slice #0
      original:   %id18878.i.0.id18878.i.0.id18878.0..i = load <4 x float>, ptr %id18878.i, align 16, !tbaa !5
            to:   %8 = bitcast <16 x i8> %id18878.i.sroa.0.0.load to <4 x float>

aka

    %8 = bitcast <16 x i8> <i8 -4, i8 -3, i8 -2, i8 -1, i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11> to <4 x float>

aka

IC: ConstFold to: <4 x float> <float 0xFFFFDFBF80000000, float 0x3860402000000000, float 0x38E0C0A080000000, float 0x3961412100000000> from:   %8 = bitcast <16 x i8> <i8 -4, i8 -3, i8 -2, i8 -1, i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11> to <4 x float>

Relevant debug output:

`*** IR Dump Before SROAPass on main ***`

``` *** IR Dump Before SROAPass on main *** ; ModuleID = '/tmp/test.cpp' source_filename = "/tmp/test.cpp" target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu" @_mm256_max_ps___b = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @_mm256_hadd_ps___b = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @_mm256_hsub_ps___b = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_15 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_14 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_13 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_12 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_11 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_10 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_8 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_7 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_6 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_5 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_4 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_3 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89_id18854 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89_id18860 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89_id18872 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89_id18873 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89_id18839 = dso_local local_unnamed_addr global <2 x i64> , align 16 @test89_id18845 = dso_local local_unnamed_addr global <4 x float> zeroinitializer, align 16 @test89_id18879 = dso_local local_unnamed_addr global <4 x float> zeroinitializer, align 16 @test89_id18881 = dso_local local_unnamed_addr global <4 x float> zeroinitializer, align 16 @.str = private unnamed_addr constant [4 x i8] c"%f\0A\00", align 1 ; Function Attrs: mustprogress norecurse uwtable define dso_local noundef i32 @main() local_unnamed_addr #0 { entry: %id18878.i = alloca <4 x float>, align 16 %0 = load <8 x half>, ptr @test89_id18839, align 16, !tbaa !5 %cvtph2ps.i = fpext <8 x half> %0 to <8 x float> store <8 x float> %cvtph2ps.i, ptr @test89___trans_tmp_3, align 32, !tbaa !5 store <16 x i8> , ptr @test89_id18845, align 16, !tbaa !5 store <8 x float> , ptr @test89___trans_tmp_4, align 32, !tbaa !5 %1 = tail call <8 x float> @llvm.x86.avx.rcp.ps.256(<8 x float> undef) store <8 x float> %1, ptr @test89___trans_tmp_5, align 32, !tbaa !5 %2 = load <8 x float>, ptr @_mm256_max_ps___b, align 32, !tbaa !5 %3 = tail call <8 x float> @llvm.x86.avx.max.ps.256(<8 x float> %1, <8 x float> %2) store <8 x float> %3, ptr @test89___trans_tmp_6, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18854, align 32, !tbaa !5 %4 = bitcast <8 x float> %3 to <8 x i32> %and.i = and <8 x i32> %4, store <8 x i32> %and.i, ptr @test89___trans_tmp_7, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18860, align 32, !tbaa !5 %5 = load <8 x float>, ptr @_mm256_hsub_ps___b, align 32, !tbaa !5 %6 = tail call <8 x float> @llvm.x86.avx.hsub.ps.256(<8 x float> , <8 x float> %5) store <8 x float> %6, ptr @test89___trans_tmp_8, align 32, !tbaa !5 br label %for.body.i52.preheader.i for.body.i52.preheader.i: ; preds = %entry store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18873, align 32, !tbaa !5 call void @llvm.lifetime.start.p0(i64 16, ptr nonnull %id18878.i) store <16 x i8> , ptr %id18878.i, align 16, !tbaa !5 %7 = load <4 x float>, ptr @test89_id18881, align 16, !tbaa !5 %test89_id18879.promoted.i = load <4 x float>, ptr @test89_id18879, align 16, !tbaa !5 %sub.i = fsub <4 x float> %test89_id18879.promoted.i, %7 %sub.1.i = fsub <4 x float> %sub.i, %7 %sub.2.i = fsub <4 x float> %sub.1.i, %7 %sub.3.i = fsub <4 x float> %sub.2.i, %7 %sub.4.i = fsub <4 x float> %sub.3.i, %7 %sub.5.i = fsub <4 x float> %sub.4.i, %7 %sub.6.i = fsub <4 x float> %sub.5.i, %7 %sub.7.i = fsub <4 x float> %sub.6.i, %7 %sub.8.i = fsub <4 x float> %sub.7.i, %7 %sub.9.i = fsub <4 x float> %sub.8.i, %7 %sub.10.i = fsub <4 x float> %sub.9.i, %7 %sub.11.i = fsub <4 x float> %sub.10.i, %7 %sub.12.i = fsub <4 x float> %sub.11.i, %7 %sub.13.i = fsub <4 x float> %sub.12.i, %7 %sub.14.i = fsub <4 x float> %sub.13.i, %7 %sub.15.i = fsub <4 x float> %sub.14.i, %7 %sub.16.i = fsub <4 x float> %sub.15.i, %7 %sub.17.i = fsub <4 x float> %sub.16.i, %7 %sub.18.i = fsub <4 x float> %sub.17.i, %7 %sub.19.i = fsub <4 x float> %sub.18.i, %7 %sub.20.i = fsub <4 x float> %sub.19.i, %7 %sub.21.i = fsub <4 x float> %sub.20.i, %7 %sub.22.i = fsub <4 x float> %sub.21.i, %7 %sub.23.i = fsub <4 x float> %sub.22.i, %7 %sub.24.i = fsub <4 x float> %sub.23.i, %7 %sub.25.i = fsub <4 x float> %sub.24.i, %7 %sub.26.i = fsub <4 x float> %sub.25.i, %7 %sub.27.i = fsub <4 x float> %sub.26.i, %7 %sub.28.i = fsub <4 x float> %sub.27.i, %7 %sub.29.i = fsub <4 x float> %sub.28.i, %7 %sub.30.i = fsub <4 x float> %sub.29.i, %7 store <4 x float> %sub.30.i, ptr @test89_id18879, align 16, !tbaa !5 %id18878.i.0.id18878.i.0.id18878.0..i = load <4 x float>, ptr %id18878.i, align 16, !tbaa !5 %8 = fadd <4 x float> %7, %id18878.i.0.id18878.i.0.id18878.0..i %vecins.i = shufflevector <4 x float> %8, <4 x float> %id18878.i.0.id18878.i.0.id18878.0..i, <4 x i32> %shuffle9.i = shufflevector <4 x float> %vecins.i, <4 x float> poison, <8 x i32> store <8 x float> %shuffle9.i, ptr @test89___trans_tmp_10, align 32, !tbaa !5 %9 = load <8 x float>, ptr @test89_id18873, align 32, !tbaa !5 %10 = tail call <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float> , <8 x float> %9, <8 x float> %shuffle9.i) store <8 x float> %10, ptr @test89___trans_tmp_11, align 32, !tbaa !5 %shuffle10.i = shufflevector <8 x float> %6, <8 x float> %10, <8 x i32> store <8 x float> %shuffle10.i, ptr @test89___trans_tmp_12, align 32, !tbaa !5 %11 = load <8 x float>, ptr @_mm256_hadd_ps___b, align 32, !tbaa !5 %12 = tail call <8 x float> @llvm.x86.avx.hadd.ps.256(<8 x float> %shuffle10.i, <8 x float> %11) store <8 x float> %12, ptr @test89___trans_tmp_13, align 32, !tbaa !5 %not.i = xor <8 x i32> %and.i, %13 = bitcast <8 x float> %12 to <8 x i32> %and11.i = and <8 x i32> %13, %not.i store <8 x i32> %and11.i, ptr @test89___trans_tmp_14, align 32, !tbaa !5 %14 = load <8 x i32>, ptr @test89___trans_tmp_3, align 32, !tbaa !5 %or.i = or <8 x i32> %14, %and11.i %15 = bitcast <8 x i32> %or.i to <8 x float> store <8 x i32> %or.i, ptr @test89___trans_tmp_15, align 32, !tbaa !5 %vecext12.i = extractelement <8 x float> %15, i64 0 %conv.i = fpext float %vecext12.i to double tail call void (...) @printf(ptr noundef nonnull @.str, double noundef %conv.i) call void @llvm.lifetime.end.p0(i64 16, ptr nonnull %id18878.i) ret i32 0 } ; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(argmem: readwrite) declare void @llvm.lifetime.start.p0(i64 immarg, ptr nocapture) #1 ; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(none) declare <8 x float> @llvm.x86.avx.rcp.ps.256(<8 x float>) #2 ; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(none) declare <8 x float> @llvm.x86.avx.max.ps.256(<8 x float>, <8 x float>) #2 ; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(none) declare <8 x float> @llvm.x86.avx.hsub.ps.256(<8 x float>, <8 x float>) #2 ; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(argmem: readwrite) declare void @llvm.lifetime.end.p0(i64 immarg, ptr nocapture) #1 ; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(none) declare <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float>, <8 x float>, <8 x float>) #2 ; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(none) declare <8 x float> @llvm.x86.avx.hadd.ps.256(<8 x float>, <8 x float>) #2 declare void @printf(...) local_unnamed_addr #3 attributes #0 = { mustprogress norecurse uwtable "frame-pointer"="none" "min-legal-vector-width"="256" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="btver2" "target-features"="+aes,+avx,+bmi,+crc32,+cx16,+cx8,+f16c,+fxsr,+lzcnt,+mmx,+movbe,+pclmul,+popcnt,+prfchw,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+sse4a,+ssse3,+x87,+xsave,+xsaveopt" } attributes #1 = { mustprogress nocallback nofree nosync nounwind willreturn memory(argmem: readwrite) } attributes #2 = { mustprogress nocallback nofree nosync nounwind willreturn memory(none) } attributes #3 = { "frame-pointer"="none" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="btver2" "target-features"="+aes,+avx,+bmi,+crc32,+cx16,+cx8,+f16c,+fxsr,+lzcnt,+mmx,+movbe,+pclmul,+popcnt,+prfchw,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+sse4a,+ssse3,+x87,+xsave,+xsaveopt" } !llvm.module.flags = !{!0, !1, !2, !3} !llvm.ident = !{!4} !0 = !{i32 1, !"wchar_size", i32 4} !1 = !{i32 8, !"PIC Level", i32 2} !2 = !{i32 7, !"PIE Level", i32 2} !3 = !{i32 7, !"uwtable", i32 2} !4 = !{!"clang version 16.0.0 (git@github.com:LebedevRI/llvm-project.git 01023bfcd33f922ed8c934ce563e54abe8bfe246)"} !5 = !{!6, !6, i64 0} !6 = !{!"omnipotent char", !7, i64 0} !7 = !{!"Simple C++ TBAA"} ```

`*** IR Dump Before InstCombinePass on main ***`

``` *** IR Dump Before InstCombinePass on main *** ; ModuleID = '/tmp/test.cpp' source_filename = "/tmp/test.cpp" target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu" @_mm256_max_ps___b = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @_mm256_hadd_ps___b = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @_mm256_hsub_ps___b = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_15 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_14 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_13 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_12 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_11 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_10 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_8 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_7 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_6 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_5 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_4 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89___trans_tmp_3 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89_id18854 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89_id18860 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89_id18872 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89_id18873 = dso_local local_unnamed_addr global <8 x float> zeroinitializer, align 32 @test89_id18839 = dso_local local_unnamed_addr global <2 x i64> , align 16 @test89_id18845 = dso_local local_unnamed_addr global <4 x float> zeroinitializer, align 16 @test89_id18879 = dso_local local_unnamed_addr global <4 x float> zeroinitializer, align 16 @test89_id18881 = dso_local local_unnamed_addr global <4 x float> zeroinitializer, align 16 @.str = private unnamed_addr constant [4 x i8] c"%f\0A\00", align 1 ; Function Attrs: mustprogress norecurse uwtable define dso_local noundef i32 @main() local_unnamed_addr #0 { entry: %0 = load <8 x half>, ptr @test89_id18839, align 16, !tbaa !5 %cvtph2ps.i = fpext <8 x half> %0 to <8 x float> store <8 x float> %cvtph2ps.i, ptr @test89___trans_tmp_3, align 32, !tbaa !5 store <16 x i8> , ptr @test89_id18845, align 16, !tbaa !5 store <8 x float> , ptr @test89___trans_tmp_4, align 32, !tbaa !5 %1 = tail call <8 x float> @llvm.x86.avx.rcp.ps.256(<8 x float> undef) store <8 x float> %1, ptr @test89___trans_tmp_5, align 32, !tbaa !5 %2 = load <8 x float>, ptr @_mm256_max_ps___b, align 32, !tbaa !5 %3 = tail call <8 x float> @llvm.x86.avx.max.ps.256(<8 x float> %1, <8 x float> %2) store <8 x float> %3, ptr @test89___trans_tmp_6, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18854, align 32, !tbaa !5 %4 = bitcast <8 x float> %3 to <8 x i32> %and.i = and <8 x i32> %4, store <8 x i32> %and.i, ptr @test89___trans_tmp_7, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18860, align 32, !tbaa !5 %5 = load <8 x float>, ptr @_mm256_hsub_ps___b, align 32, !tbaa !5 %6 = tail call <8 x float> @llvm.x86.avx.hsub.ps.256(<8 x float> , <8 x float> %5) store <8 x float> %6, ptr @test89___trans_tmp_8, align 32, !tbaa !5 br label %for.body.i52.preheader.i for.body.i52.preheader.i: ; preds = %entry store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18872, align 32, !tbaa !5 store <32 x i8> , ptr @test89_id18873, align 32, !tbaa !5 %7 = load <4 x float>, ptr @test89_id18881, align 16, !tbaa !5 %test89_id18879.promoted.i = load <4 x float>, ptr @test89_id18879, align 16, !tbaa !5 %sub.i = fsub <4 x float> %test89_id18879.promoted.i, %7 %sub.1.i = fsub <4 x float> %sub.i, %7 %sub.2.i = fsub <4 x float> %sub.1.i, %7 %sub.3.i = fsub <4 x float> %sub.2.i, %7 %sub.4.i = fsub <4 x float> %sub.3.i, %7 %sub.5.i = fsub <4 x float> %sub.4.i, %7 %sub.6.i = fsub <4 x float> %sub.5.i, %7 %sub.7.i = fsub <4 x float> %sub.6.i, %7 %sub.8.i = fsub <4 x float> %sub.7.i, %7 %sub.9.i = fsub <4 x float> %sub.8.i, %7 %sub.10.i = fsub <4 x float> %sub.9.i, %7 %sub.11.i = fsub <4 x float> %sub.10.i, %7 %sub.12.i = fsub <4 x float> %sub.11.i, %7 %sub.13.i = fsub <4 x float> %sub.12.i, %7 %sub.14.i = fsub <4 x float> %sub.13.i, %7 %sub.15.i = fsub <4 x float> %sub.14.i, %7 %sub.16.i = fsub <4 x float> %sub.15.i, %7 %sub.17.i = fsub <4 x float> %sub.16.i, %7 %sub.18.i = fsub <4 x float> %sub.17.i, %7 %sub.19.i = fsub <4 x float> %sub.18.i, %7 %sub.20.i = fsub <4 x float> %sub.19.i, %7 %sub.21.i = fsub <4 x float> %sub.20.i, %7 %sub.22.i = fsub <4 x float> %sub.21.i, %7 %sub.23.i = fsub <4 x float> %sub.22.i, %7 %sub.24.i = fsub <4 x float> %sub.23.i, %7 %sub.25.i = fsub <4 x float> %sub.24.i, %7 %sub.26.i = fsub <4 x float> %sub.25.i, %7 %sub.27.i = fsub <4 x float> %sub.26.i, %7 %sub.28.i = fsub <4 x float> %sub.27.i, %7 %sub.29.i = fsub <4 x float> %sub.28.i, %7 %sub.30.i = fsub <4 x float> %sub.29.i, %7 store <4 x float> %sub.30.i, ptr @test89_id18879, align 16, !tbaa !5 %8 = bitcast <16 x i8> to <4 x float> %9 = fadd <4 x float> %7, %8 %vecins.i = shufflevector <4 x float> %9, <4 x float> %8, <4 x i32> %shuffle9.i = shufflevector <4 x float> %vecins.i, <4 x float> poison, <8 x i32> store <8 x float> %shuffle9.i, ptr @test89___trans_tmp_10, align 32, !tbaa !5 %10 = load <8 x float>, ptr @test89_id18873, align 32, !tbaa !5 %11 = tail call <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float> , <8 x float> %10, <8 x float> %shuffle9.i) store <8 x float> %11, ptr @test89___trans_tmp_11, align 32, !tbaa !5 %shuffle10.i = shufflevector <8 x float> %6, <8 x float> %11, <8 x i32> store <8 x float> %shuffle10.i, ptr @test89___trans_tmp_12, align 32, !tbaa !5 %12 = load <8 x float>, ptr @_mm256_hadd_ps___b, align 32, !tbaa !5 %13 = tail call <8 x float> @llvm.x86.avx.hadd.ps.256(<8 x float> %shuffle10.i, <8 x float> %12) store <8 x float> %13, ptr @test89___trans_tmp_13, align 32, !tbaa !5 %not.i = xor <8 x i32> %and.i, %14 = bitcast <8 x float> %13 to <8 x i32> %and11.i = and <8 x i32> %14, %not.i store <8 x i32> %and11.i, ptr @test89___trans_tmp_14, align 32, !tbaa !5 %15 = load <8 x i32>, ptr @test89___trans_tmp_3, align 32, !tbaa !5 %or.i = or <8 x i32> %15, %and11.i %16 = bitcast <8 x i32> %or.i to <8 x float> store <8 x i32> %or.i, ptr @test89___trans_tmp_15, align 32, !tbaa !5 %vecext12.i = extractelement <8 x float> %16, i64 0 %conv.i = fpext float %vecext12.i to double tail call void (...) @printf(ptr noundef nonnull @.str, double noundef %conv.i) ret i32 0 } ; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(argmem: readwrite) declare void @llvm.lifetime.start.p0(i64 immarg, ptr nocapture) #1 ; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(none) declare <8 x float> @llvm.x86.avx.rcp.ps.256(<8 x float>) #2 ; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(none) declare <8 x float> @llvm.x86.avx.max.ps.256(<8 x float>, <8 x float>) #2 ; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(none) declare <8 x float> @llvm.x86.avx.hsub.ps.256(<8 x float>, <8 x float>) #2 ; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(argmem: readwrite) declare void @llvm.lifetime.end.p0(i64 immarg, ptr nocapture) #1 ; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(none) declare <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float>, <8 x float>, <8 x float>) #2 ; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(none) declare <8 x float> @llvm.x86.avx.hadd.ps.256(<8 x float>, <8 x float>) #2 declare void @printf(...) local_unnamed_addr #3 attributes #0 = { mustprogress norecurse uwtable "frame-pointer"="none" "min-legal-vector-width"="256" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="btver2" "target-features"="+aes,+avx,+bmi,+crc32,+cx16,+cx8,+f16c,+fxsr,+lzcnt,+mmx,+movbe,+pclmul,+popcnt,+prfchw,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+sse4a,+ssse3,+x87,+xsave,+xsaveopt" } attributes #1 = { mustprogress nocallback nofree nosync nounwind willreturn memory(argmem: readwrite) } attributes #2 = { mustprogress nocallback nofree nosync nounwind willreturn memory(none) } attributes #3 = { "frame-pointer"="none" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="btver2" "target-features"="+aes,+avx,+bmi,+crc32,+cx16,+cx8,+f16c,+fxsr,+lzcnt,+mmx,+movbe,+pclmul,+popcnt,+prfchw,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+sse4a,+ssse3,+x87,+xsave,+xsaveopt" } !llvm.module.flags = !{!0, !1, !2, !3} !llvm.ident = !{!4} !0 = !{i32 1, !"wchar_size", i32 4} !1 = !{i32 8, !"PIC Level", i32 2} !2 = !{i32 7, !"PIE Level", i32 2} !3 = !{i32 7, !"uwtable", i32 2} !4 = !{!"clang version 16.0.0 (git@github.com:LebedevRI/llvm-project.git 01023bfcd33f922ed8c934ce563e54abe8bfe246)"} !5 = !{!6, !6, i64 0} !6 = !{!"omnipotent char", !7, i64 0} !7 = !{!"Simple C++ TBAA"} ```

`SROA function: main`

``` SROA function: main SROA alloca: %id18878.i = alloca <4 x float>, align 16 Rewriting FCA loads and stores... Slices of alloca: %id18878.i = alloca <4 x float>, align 16 [0,16) slice #0 used by: %id18878.i.0.id18878.i.0.id18878.0..i = load <4 x float>, ptr %id18878.i, align 16, !tbaa !5 [0,16) slice #1 used by: store <16 x i8> , ptr %id18878.i, align 16, !tbaa !5 [0,16) slice #2 (splittable) used by: call void @llvm.lifetime.start.p0(i64 16, ptr nonnull %id18878.i) [0,16) slice #3 (splittable) used by: call void @llvm.lifetime.end.p0(i64 16, ptr nonnull %id18878.i) Pre-splitting loads and stores Searching for candidate loads and stores Rewriting alloca partition [0,16) to: %id18878.i.sroa.0 = alloca <16 x i8>, align 16 rewriting [0,16) slice #0 original: %id18878.i.0.id18878.i.0.id18878.0..i = load <4 x float>, ptr %id18878.i, align 16, !tbaa !5 to: %8 = bitcast <16 x i8> %id18878.i.sroa.0.0.load to <4 x float> rewriting [0,16) slice #1 original: store <16 x i8> , ptr %id18878.i, align 16, !tbaa !5 to: store <16 x i8> , ptr %id18878.i.sroa.0, align 16, !tbaa !5 rewriting [0,16) slice #2 (splittable) original: call void @llvm.lifetime.start.p0(i64 16, ptr nonnull %id18878.i) to: call void @llvm.lifetime.start.p0(i64 16, ptr %id18878.i.sroa.0) rewriting [0,16) slice #3 (splittable) original: call void @llvm.lifetime.end.p0(i64 16, ptr nonnull %id18878.i) to: call void @llvm.lifetime.end.p0(i64 16, ptr %id18878.i.sroa.0) Speculating PHIs Speculating Selects Deleting dead instruction: call void @llvm.lifetime.end.p0(i64 16, ptr nonnull %id18878.i) Deleting dead instruction: call void @llvm.lifetime.start.p0(i64 16, ptr nonnull %id18878.i) Deleting dead instruction: store <16 x i8> , ptr %id18878.i, align 16, !tbaa !5 Deleting dead instruction: %id18878.i.0.id18878.i.0.id18878.0..i = load <4 x float>, ptr %id18878.i, align 16, !tbaa !5 Deleting dead instruction: %id18878.i = alloca <4 x float>, align 16 Promoting allocas with mem2reg... *** IR Dump Before InstCombinePass on main ***

dyung commented 1 year ago

@dyung running opt pipeline on that example even with clang-15 produces again different results: https://godbolt.org/z/Kd1WbfW5d Are you //sure// there is no usual FP brokenness going on in that example?

I’m not really familiar enough with FP stuff to say unfortunately. Although I did note that before your change, the O0 and O2 compilations did produce the same result, while after only the O0 compilation produces the expected result.

LebedevRI commented 1 year ago

Usually that's a tell-tale of UB in source code.

LebedevRI commented 1 year ago

Another observation:

$ clang++-15 -march=btver2 -O3 /tmp/test.cpp -o old.ll -S -emit-llvm 
/tmp/test.cpp:29:8: warning: implicit conversion from 'int' to 'char' changes value from 211 to -45 [-Wconstant-conversion]
  init(211, &test89_id18854, sizeof(test89_id18854));
  ~~~~ ^~~
/tmp/test.cpp:31:8: warning: implicit conversion from 'int' to 'char' changes value from 205 to -51 [-Wconstant-conversion]
  init(205, &test89_id18860, sizeof(test89_id18860));
  ~~~~ ^~~
/tmp/test.cpp:35:10: warning: implicit conversion from 'int' to 'char' changes value from 220 to -36 [-Wconstant-conversion]
    init(220, &test89_id18872, sizeof(test89_id18872));
    ~~~~ ^~~
/tmp/test.cpp:38:8: warning: implicit conversion from 'int' to 'char' changes value from 220 to -36 [-Wconstant-conversion]
  init(220, &test89_id18873, sizeof(test89_id18873));
  ~~~~ ^~~
/tmp/test.cpp:40:8: warning: implicit conversion from 'int' to 'char' changes value from 252 to -4 [-Wconstant-conversion]
  init(252, &id18878, sizeof(id18878));
  ~~~~ ^~~
5 warnings generated.
$ lli-16 old.ll 
-268361104.000000
$ ./bin/opt -sroa old.ll -o - | lli-16 -
The `opt -passname` syntax for the new pass manager is deprecated, please use `opt -passes=<pipeline>` (or the `-p` alias for a more concise version).
See https://llvm.org/docs/NewPassManager.html#invoking-opt for more details on the pass pipeline syntax.

-268361104.000000
$ ./bin/opt -sroa -instcombine old.ll -o - | lli-16 -
The `opt -passname` syntax for the new pass manager is deprecated, please use `opt -passes=<pipeline>` (or the `-p` alias for a more concise version).
See https://llvm.org/docs/NewPassManager.html#invoking-opt for more details on the pass pipeline syntax.

-4939670898945374807849435136.000000
$ ./bin/opt -instcombine old.ll -o - | lli-16 -
The `opt -passname` syntax for the new pass manager is deprecated, please use `opt -passes=<pipeline>` (or the `-p` alias for a more concise version).
See https://llvm.org/docs/NewPassManager.html#invoking-opt for more details on the pass pipeline syntax.

-268361104.000000

So the SROA itself does not appear to cause the problem, but InstCombine manages to capitalize on the promotion, and expose it.

dyung commented 1 year ago

It's come to my attention that the reduction may have introduced some uses of undefined variables. Let me try to get the original test with the dependencies as reduced as I can.

LebedevRI commented 1 year ago

It's come to my attention that the reduction may have introduced some uses of undefined variables. Let me try to get the original test with the dependencies as reduced as I can.

Thanks! Just so we eliminate the obvious, given the most original test case you have, what happens if you do:

$ good-clang++ -march=btver2 -O3 unreduced-test.cpp -S -emit-llvm -o - | good-opt -sroa -instcombine - -o -| good-lli -

? If it does not print -268361104.000000, then the problem is elsewhere.

dyung commented 1 year ago

It's come to my attention that the reduction may have introduced some uses of undefined variables. Let me try to get the original test with the dependencies as reduced as I can.

Thanks! Just so we eliminate the obvious, given the most original test case you have, what happens if you do:
$ good-clang++ -march=btver2 -O3 unreduced-test.cpp -S -emit-llvm -o - | good-opt -sroa -instcombine - -o -| good-lli -
? If it does not print -268361104.000000, then the problem is elsewhere.

The original-ish test case that I am working on reducing does indeed print that:

$ ~/src/upstream/ffe1661fabc9cf379a10a0bf15268c6549e4836f-linux/bin/clang++ -march=btver2 -O3 test.cpp -S -emit-llvm -o - | ~/src/upstream/ffe1661fabc9cf379a10a0bf15268c6549e4836f-linux/bin/opt -sroa -instcombine - -o - | ~/src/upstream/ffe1661fabc9cf379a10a0bf15268c6549e4836f-linux/bin/lli -
test.cpp:68:10: warning: ISO C++11 does not allow conversion from string literal to 'char *' [-Wwritable-strings]
  printf("%s:%s\n", msg, tmp);
         ^
  printf("%f\n", id18835[0]);
         ^
2 warnings generated.
The `opt -passname` syntax for the new pass manager is deprecated, please use `opt -passes=<pipeline>` (or the `-p` alias for a more concise version).
See https://llvm.org/docs/NewPassManager.html#invoking-opt for more details on the pass pipeline syntax.

-268361104.000000

dyung commented 1 year ago

Try this example which more or less leaves the original test function intact (I only removed the printing of some of the other bytes of id18835 which did not seem to be affected):

extern "C" {
  int sprintf(char *str, const char *format, ...);
  int printf(char *, ...);
  int isnan(float);
  int isinf(float);
};

#include <x86intrin.h>

__attribute__((optnone))
float norm_nan(float x) {
    if (isnan(x) || isinf(x))
        return 0.0f;
    return x;
}

template <typename T>
static T zero_upper(T in, unsigned bits)
{
  constexpr unsigned elems = sizeof(T) / sizeof(char);
  union { T x; char c[elems]; };
  x = in;
  unsigned elems_to_zero = bits / 8;
  for (unsigned i = elems_to_zero; i != elems; ++i)
    c[i] = 0;
  return x;
}

static void init(unsigned char pred, volatile void *data, unsigned size) {
  unsigned char *bytes = (unsigned char *)data;
  for (unsigned i = 0; i != size; ++i) {
    bytes[i] = pred + i;
  }
}
#define INIT(PRED, VAR) init(PRED, &VAR, sizeof(VAR))

__attribute__((noinline))
static void print(const char *msg, volatile void *data, unsigned size) {
  unsigned char *bytes = (unsigned char *)data;
  char tmp[256];
  for (unsigned i = 0; i != size; ++i) {
    sprintf(tmp + i * 2, "%02x", bytes[size - 1 - i]);
  }

  printf("%s:%s\n", msg, tmp);
}
#define PRINT(VAR) print(#VAR, &VAR, sizeof(VAR))

typedef unsigned char uchar;
typedef long long ll;
typedef ll __attribute__((ext_vector_type(2))) ll2;
typedef float __attribute__((ext_vector_type(8))) float8;

#define cast_ll2_to___m128i(__A) ((__m128i)(__A))

#define SAFE__mm256_castps128_ps256(__A) (zero_upper(_mm256_castps128_ps256(__A), 128))
#define static_cast___m256_to_float8(__A) (static_cast<float8>(__A))

void test89()
{
          ll2 id18839 = (ll2){(ll)-1964383749LL, (ll)994513392LL}; // vec_type
        __m128i id18838 = cast_ll2_to___m128i(id18839);
      __m256 id18837 = _mm256_cvtph_ps(id18838);
                __m128 id18845;
                INIT(69, id18845);
              __m256 id18844 = SAFE__mm256_castps128_ps256(id18845);
            __m256 id18843 = _mm256_rcp_ps(id18844);
                  uchar id18851;
                  INIT(113, id18851);
                volatile __m256 id18848;
                INIT(189, id18848);
                for (uchar id18849_idx = 0; (id18849_idx < id18851); ++id18849_idx)
                {
                  __m256 id18850;
                  INIT(223, id18850);
                  id18848 += id18850;
                }
                  long long id18853;
                  INIT(227, id18853);
                __m256i id18852 = _mm256_set1_epi64x(id18853);
              __m256 id18847 = _mm256_permutevar_ps(id18848, id18852);
            __m256 id18846 = _mm256_movehdup_ps(id18847);
          __m256 id18842 = _mm256_max_ps(id18843, id18846);
          volatile __m256 id18854;
          INIT(211, id18854);
        volatile __m256 id18841 = _mm256_and_ps(id18842, id18854);
              __m256 id18858;
              INIT(120, id18858);
              for (uchar id18859_idx = 0; (id18859_idx < 4); ++id18859_idx)
              {
                static volatile __m256 id18860;
                INIT(205, id18860);
                id18858 = id18860;
              }
              volatile __m256 id18861;
              INIT(18, id18861);
              for (uchar id18862_idx = 0; (id18862_idx < 137); ++id18862_idx)
              {
                    uchar id18867 = id18862_idx;
                  __m256 id18864;
                  INIT(214, id18864);
                  for (uchar id18865_idx = 0; (id18865_idx < id18867); ++id18865_idx)
                  {
                    volatile __m256 id18866;
                    INIT(239, id18866);
                    id18864 -= id18866;
                  }
                  __m256 id18868;
                  INIT(78, id18868);
                __m256 id18863 = _mm256_addsub_ps(id18864, id18868);
                id18861 += id18863;
              }
            __m256 id18857 = _mm256_hsub_ps(id18858, id18861);
              __m256 id18870;
              INIT(83, id18870);
              for (uchar id18871_idx = 0; (id18871_idx < 92); ++id18871_idx)
              {
                __m256 id18872;
                INIT(220, id18872);
                id18870 *= id18872;
              }
              volatile __m256 id18873;
              INIT(220, id18873);
                __m128 id18875;
                INIT(11, id18875);
                for (uchar id18876_idx = 0; (id18876_idx < 214); ++id18876_idx)
                {
                    __m128 id18878;
                    INIT(252, id18878);
                    __m128 id18879;
                    INIT(59, id18879);
                    for (uchar id18880_idx = 0; (id18880_idx < 131); ++id18880_idx)
                    {
                      __m128 id18881;
                      INIT(78, id18881);
                      id18879 -= id18881;
                    }
                  __m128 id18877 = _mm_add_ss(id18878, id18879);
                  id18875 *= id18877;
                }
              __m256 id18874 = SAFE__mm256_castps128_ps256(id18875);
            __m256 id18869 = _mm256_blendv_ps(id18870, id18873, id18874);
          __m256 id18856 = _mm256_unpacklo_ps(id18857, id18869);
          __m256 id18882;
          INIT(122, id18882);
          for (uchar id18883_idx = 0; (id18883_idx < 252); ++id18883_idx)
          {
            __m256 id18884;
            INIT(21, id18884);
            id18882 += id18884;
          }
        __m256 id18855 = _mm256_hadd_ps(id18856, id18882);
      __m256 id18840 = _mm256_andnot_ps(id18841, id18855);
    __m256 id18836 = _mm256_or_ps(id18837, id18840);
  float8 id18835 = static_cast___m256_to_float8(id18836);
  id18835[0] = norm_nan(id18835[0]);
  printf("%f\n", id18835[0]);
}

int main() {
  test89();
}

This was the source that I used with the command you asked me to try. There shouldn't be any uses of undefined variables here hopefully!

LebedevRI commented 1 year ago

FWIW, that didn't change the outcome, running -sroa -instcombine on the good IR still resurfaces the problem:

$ clang++-15 -march=btver2 -O2 /tmp/test.cpp -o old.ll -S -emit-llvm 
/tmp/test.cpp:45:10: warning: ISO C++11 does not allow conversion from string literal to 'char *' [-Wwritable-strings]
  printf("%s:%s\n", msg, tmp);
         ^
/tmp/test.cpp:158:10: warning: ISO C++11 does not allow conversion from string literal to 'char *' [-Wwritable-strings]
  printf("%f\n", id18835[0]);
         ^
2 warnings generated.
$ ./bin/opt -sroa -instcombine old.ll -o - | lli-16 -
The `opt -passname` syntax for the new pass manager is deprecated, please use `opt -passes=<pipeline>` (or the `-p` alias for a more concise version).
See https://llvm.org/docs/NewPassManager.html#invoking-opt for more details on the pass pipeline syntax.

-3701730859659994532950310912.000000
$ ./bin/opt -sroa old.ll -o - | lli-16 -
The `opt -passname` syntax for the new pass manager is deprecated, please use `opt -passes=<pipeline>` (or the `-p` alias for a more concise version).
See https://llvm.org/docs/NewPassManager.html#invoking-opt for more details on the pass pipeline syntax.

-268361104.000000
$ ./bin/opt -instcombine old.ll -o - | lli-16 -
The `opt -passname` syntax for the new pass manager is deprecated, please use `opt -passes=<pipeline>` (or the `-p` alias for a more concise version).
See https://llvm.org/docs/NewPassManager.html#invoking-opt for more details on the pass pipeline syntax.

-268361104.000000

LebedevRI commented 1 year ago

The original-ish test case that I am working on reducing does indeed print that:

$ ~/src/upstream/ffe1661fabc9cf379a10a0bf15268c6549e4836f-linux/bin/clang++ -march=btver2 -O3 test.cpp -S -emit-llvm -o - | ~/src/upstream/ffe1661fabc9cf379a10a0bf15268c6549e4836f-linux/bin/opt -sroa -instcombine - -o - | ~/src/upstream/ffe1661fabc9cf379a10a0bf15268c6549e4836f-linux/bin/lli -
test.cpp:68:10: warning: ISO C++11 does not allow conversion from string literal to 'char *' [-Wwritable-strings]
  printf("%s:%s\n", msg, tmp);
         ^
  printf("%f\n", id18835[0]);
         ^
2 warnings generated.
The `opt -passname` syntax for the new pass manager is deprecated, please use `opt -passes=<pipeline>` (or the `-p` alias for a more concise version).
See https://llvm.org/docs/NewPassManager.html#invoking-opt for more details on the pass pipeline syntax.

-268361104.000000

Actually, no, it doesn't:

$ clang++-15 -march=btver2 -O2 /tmp/test.cpp -o - -emit-llvm -S | ./bin/opt -O2 -o - - | ./bin/lli
/tmp/test.cpp:45:10: warning: ISO C++11 does not allow conversion from string literal to 'char *' [-Wwritable-strings]
  printf("%s:%s\n", msg, tmp);
         ^
/tmp/test.cpp:158:10: warning: ISO C++11 does not allow conversion from string literal to 'char *' [-Wwritable-strings]
  printf("%f\n", id18835[0]);
         ^
2 warnings generated.
-3701730859659994532950310912.000000

LebedevRI commented 1 year ago

I strongly suspect UB in your original source code.

dyung commented 1 year ago

The original-ish test case that I am working on reducing does indeed print that:

$ ~/src/upstream/ffe1661fabc9cf379a10a0bf15268c6549e4836f-linux/bin/clang++ -march=btver2 -O3 test.cpp -S -emit-llvm -o - | ~/src/upstream/ffe1661fabc9cf379a10a0bf15268c6549e4836f-linux/bin/opt -sroa -instcombine - -o - | ~/src/upstream/ffe1661fabc9cf379a10a0bf15268c6549e4836f-linux/bin/lli -
test.cpp:68:10: warning: ISO C++11 does not allow conversion from string literal to 'char *' [-Wwritable-strings]
  printf("%s:%s\n", msg, tmp);
         ^
  printf("%f\n", id18835[0]);
         ^
2 warnings generated.
The `opt -passname` syntax for the new pass manager is deprecated, please use `opt -passes=<pipeline>` (or the `-p` alias for a more concise version).
See https://llvm.org/docs/NewPassManager.html#invoking-opt for more details on the pass pipeline syntax.

-268361104.000000

Actually, no, it doesn't:

$ clang++-15 -march=btver2 -O2 /tmp/test.cpp -o - -emit-llvm -S | ./bin/opt -O2 -o - - | ./bin/lli
/tmp/test.cpp:45:10: warning: ISO C++11 does not allow conversion from string literal to 'char *' [-Wwritable-strings]
  printf("%s:%s\n", msg, tmp);
         ^
/tmp/test.cpp:158:10: warning: ISO C++11 does not allow conversion from string literal to 'char *' [-Wwritable-strings]
  printf("%f\n", id18835[0]);
         ^
2 warnings generated.
-3701730859659994532950310912.000000

Interestingly, there seems to be a difference between O2 and O3:

$ ~/src/upstream/ffe1661fabc9cf379a10a0bf15268c6549e4836f-linux/bin/clang++ -march=btver2 -O2 test.cpp -o - -emit-llvm -S | ~/src/upstream/8adfa29706e5407b62a4726e2172894e0dfdc1e8-linux/bin/opt -O2 -o - - | ~/src/upstream/8adfa29706e5407b62a4726e2172894e0dfdc1e8-linux/bin/lli -
test.cpp:45:10: warning: ISO C++11 does not allow conversion from string literal to 'char *' [-Wwritable-strings]
  printf("%s:%s\n", msg, tmp);
         ^
test.cpp:158:10: warning: ISO C++11 does not allow conversion from string literal to 'char *' [-Wwritable-strings]
  printf("%f\n", id18835[0]);
         ^
2 warnings generated.
-3701730859659994532950310912.000000
$ ~/src/upstream/ffe1661fabc9cf379a10a0bf15268c6549e4836f-linux/bin/clang++ -march=btver2 -O3 test.cpp -o - -emit-llvm -S | ~/src/upstream/8adfa29706e5407b62a4726e2172894e0dfdc1e8-linux/bin/opt -O2 -o - - | ~/src/upstream/8adfa29706e5407b62a4726e2172894e0dfdc1e8-linux/bin/lli -
test.cpp:45:10: warning: ISO C++11 does not allow conversion from string literal to 'char *' [-Wwritable-strings]
  printf("%s:%s\n", msg, tmp);
         ^
test.cpp:158:10: warning: ISO C++11 does not allow conversion from string literal to 'char *' [-Wwritable-strings]
  printf("%f\n", id18835[0]);
         ^
2 warnings generated.
-268361104.000000

LebedevRI commented 1 year ago

(Which, again, is usually a tell-tale of UB.)

dyung commented 1 year ago

I think you are probably right at this point. I'm not very familiar with fp stuff, any suggestions on how to find the UB stuff? Is there a sanitizer for this I could try? I'll also try to ask around internally to see if anyone has any ideas.

EugeneZelenko commented 1 year ago

@dyung: Yes, there is undefined behavior sanitizer. Please also enable as much compiler warnings as possible and run static analysis tool like Clang Static Analyzer and Clang-tidy.

LebedevRI commented 1 year ago

Unfortunately, i've already tried UBSan/ASan/MSan on these examples, and they didn't complain. If you do figure it out, please let me know, perhaps UBSan can be taught about it.

gregbedwell commented 1 year ago

Here's a further reduced example. I think this is nan weirdness - possibly caused by the way id18878 is initialized. At O0 I observe:

id18875:0.000000 0.000000 0.000000 0.000000
id18879:0.184800 48.312740 12625.065430 3297809.750000
id18877:-nan 0.000000 0.000000 0.000000
id18875:-nan 0.000000 0.000000 0.000000

and at O2:

id18875:0.000000 0.000000 0.000000 0.000000
id18879:0.184800 48.312740 12625.065430 3297809.750000
id18877:nan 0.000000 0.000000 0.000000
id18875:nan 0.000000 0.000000 0.000000

extern "C" {
  int sprintf(const char *str, const char *format, ...);
  int printf(const char *, ...);
};

#include <x86intrin.h>

static void init(unsigned char pred, volatile void *data, unsigned size) {
  unsigned char *bytes = (unsigned char *)data;
  for (unsigned i = 0; i != size; ++i) {
    bytes[i] = pred + i;
  }
}
#define INIT(PRED, VAR) init(PRED, &VAR, sizeof(VAR))

void test89()
{
                __m128 id18875;
                INIT(11, id18875);
                printf("id18875:%f %f %f %f\n", id18875[0], id18875[1], id18875[2], id18875[3]);
                    __m128 id18878;
                    INIT(252, id18878);
                    __m128 id18879;
                    INIT(59, id18879);
                    id18879 = (__m128){0.1848f, 48.31274f, 12625.06543f, 3297809.75f};
                    printf("id18879:%f %f %f %f\n", id18879[0], id18879[1], id18879[2], id18879[3]);
                    for (unsigned char id18880_idx = 0; (id18880_idx < 131); ++id18880_idx)
                    {
                      __m128 id18881 = (__m128){55917731840.f, 14590895194112.f, 3805913865519104.f, 992398991704457215.f};
                      id18879 -= id18881;
                    }
                  __m128 id18877 = _mm_add_ss(id18878, id18879);
                  id18875 *= id18877;
                  printf("id18877:%f %f %f %f\n", id18877[0], id18877[1], id18877[2], id18877[3]);
                printf("id18875:%f %f %f %f\n", id18875[0], id18875[1], id18875[2], id18875[3]);
}

int main() {
  test89();
}

wjristow commented 1 year ago

I've looked into this, and the short summary is that I'm virtually certain this is a latent compiler bug in InstCombine exposed by the new run of SROA (rather than undefined behavior, or a more blatant bug in the test-case). As suggested by @EugeneZelenko, it looks related to shuffle vector (possibly some incorrect poisoning of operands), and as noted by @gregbedwell, it's related to the handling of NaNs (so something of a rare corner case, IMO).

Using a modified version of Greg's reduced test-case (listed below, containing some changes that include a more detailed printing of vector float values, via a function printVect), I get the following behavior at '-O0'

        id18875: 1.739e-30(0x0e0d0c0b) 4.577e-28(0x1211100f) 1.204e-25(0x16151413) 3.166e-23(0x1a191817)
        id18879: 1.848e-01(0x3e3d3c36) 4.831e+01(0x4241403f) 1.263e+04(0x46454443) 3.298e+06(0x4a494847)
  Input id18877: -nan(0xfffefdfc) 3.820e-37(0x03020100) 1.008e-34(0x07060504) 2.658e-32(0x0b0a0908)
  Input id18879: -7.325e+12(0xd4d53138) -1.911e+15(0xd8d94d56) -4.986e+17(0xdcdd696f) -1.300e+20(0xe0e18589)
id18877[0] += id18879[0];
Updated id18877: -nan(0xfffefdfc) 3.820e-37(0x03020100) 1.008e-34(0x07060504) 2.658e-32(0x0b0a0908)
id18875 *= id18877;
Updated id18875: -nan(0xfffefdfc) 0.000e+00(0x00000000) 0.000e+00(0x00000000) 0.000e+00(0x00000000)

and at '-O2':

        id18875: 1.739e-30(0x0e0d0c0b) 4.577e-28(0x1211100f) 1.204e-25(0x16151413) 3.166e-23(0x1a191817)
        id18879: 1.848e-01(0x3e3d3c36) 4.831e+01(0x4241403f) 1.263e+04(0x46454443) 3.298e+06(0x4a494847)
  Input id18877: -nan(0xfffefdfc) 3.820e-37(0x03020100) 1.008e-34(0x07060504) 2.658e-32(0x0b0a0908)
  Input id18879: -7.325e+12(0xd4d53138) -1.911e+15(0xd8d94d56) -4.986e+17(0xdcdd696f) -1.300e+20(0xe0e18589)
id18877[0] += id18879[0];
Updated id18877: nan(0x7fc00000) 3.820e-37(0x03020100) 1.008e-34(0x07060504) 2.658e-32(0x0b0a0908)
id18875 *= id18877;
Updated id18875: nan(0x7fc00000) 0.000e+00(0x00000000) 0.000e+00(0x00000000) 0.000e+00(0x00000000)

The two "Updated" lines produce the wrong NaN value. (FTR, the compiler previous to 8adfa29706e5407b62a4726e2172894e0dfdc1e8 gets the '-O0' results shown above at all optimization levels.) The "Updated id18877" value that's printed above in the '-O2' version isn't actually computed at run-time -- it's just directly a literal in the IR (a literal that appears after a transformation related to a shufflevector in InstCombine).

typedef unsigned long size_t;
typedef float __m128 __attribute__((__vector_size__(16), __aligned__(16)));
extern "C" int printf(const char *, ...);
extern "C" void *memcpy(void *dest, const void *src, size_t n);

static __attribute__((noinline)) void rawFloatPrint(float f) {
  unsigned int u;
  memcpy(&u, &f, sizeof(f));
  printf("%.3e(0x%08x)", f, u);
}
static __attribute__((noinline)) void printVect(const char *name, __m128 val) {
  if ((name != nullptr) && (*name != '\0'))
    printf("%15s: ", name);
  for (int i = 0; i < 4; ++i) {
    if (i) printf(" ");
    rawFloatPrint(val[i]);
  }
  printf("\n");
}
static void init(unsigned char pred, volatile void *data, unsigned size) {
  unsigned char *bytes = (unsigned char *)data;
  for (unsigned i = 0; i != size; ++i) {
    bytes[i] = pred + i;
  }
}
#define INIT(PRED, VAR) init(PRED, &VAR, sizeof(VAR))
void test89() {
  __m128 id18875;
  INIT(11, id18875);
  printVect("id18875", id18875);
  __m128 id18878;
  INIT(252, id18878); // 'id18878[0]' Becomes 0xfffefdfc (a NaN).
  __m128 id18879;
  INIT(59, id18879);
  id18879 = (__m128){0.1848f, 48.31274f, 12625.06543f, 3297809.75f};
  printVect("id18879", id18879);
  __m128 id18881 = (__m128){55917731840.f, 14590895194112.f,
                            3805913865519104.f, 992398991704457215.f};

  for (unsigned char id18880_idx = 0; id18880_idx < 131; ++id18880_idx)
    id18879 -= id18881;

  __m128 id18877 = id18878; // 'id18878' was set to 0xfffefdfc (NaN) above.
  printVect("Input id18877", id18877); // id18877[0] is 0xfffefdfc (NaN)
  printVect("Input id18879", id18879); // id18879[0] is -7.325e+12
  printf("id18877[0] += id18879[0];\n");
  id18877[0] += id18879[0];            // id18877[0] = 0xfffefdfc + (-7.325e+12)
  printVect("Updated id18877", id18877); // id18877[0] _should_ be the same NaN
  printf("id18875 *= id18877;\n");
  id18875 *= id18877;
  printVect("Updated id18875", id18875);
}
int main() { test89(); }

LebedevRI commented 1 year ago

cc @spatel-gh

LebedevRI commented 1 year ago

@nunoplopes FYI, this might be in your current area of interest.

rotateright commented 1 year ago

A specific NaN value is replaced with a canonical NaN value. We're not concerned about anything else in this report, are we?

I think this is the minimized IR that shows the problem:

define <2 x double> @f(<2 x double> %x) {
  %a = fadd <2 x double> %x, <double 0xFFFFDFBF80000000, double poison>
  %r = shufflevector <2 x double> %a, <2 x double> <double 0xFFFFDFBF80000000, double 0xdead000000000000>, <2 x i32> <i32 0, i32 3>
  ret <2 x double> %r
}

% opt -p instcombine addnan.ll -S -debug Args: opt -p instcombine addnan.ll -S -debug

INSTCOMBINE ITERATION #1 on f ADD: ret <2 x double> %r ADD: %r = shufflevector <2 x double> %a, <2 x double> <double 0xFFFFDFBF80000000, double 0xDEAD000000000000>, <2 x i32> <i32 0, i32 3> ADD: %a = fadd <2 x double> %x, <double 0xFFFFDFBF80000000, double poison> IC: Visiting: %a = fadd <2 x double> %x, <double 0xFFFFDFBF80000000, double poison> IC: Replacing %a = fadd <2 x double> %x, <double 0xFFFFDFBF80000000, double poison> with <2 x double> <double 0x7FF8000000000000, double 0x7FF8000000000000>

define <2 x double> @f(<2 x double> %x) { ret <2 x double> <double 0x7FF8000000000000, double 0xDEAD000000000000> }

rotateright commented 1 year ago

Backtrack on the path that led to that behavior (in InstSimplify): https://reviews.llvm.org/D44521 https://lists.llvm.org/pipermail/llvm-dev/2018-March/121481.html

That seems to be before we introduced "poison" in IR. Now that we have it, we can probably sidestep the issue in this example at least.

But also note that there is no requirement that we preserve a NaN payload in IEEE-754, so technically, there is no bug here. Ie, we give our best effort to not change NaN bits, but users/Alive2 can't rely on that behavior.

rotateright commented 1 year ago

The output difference should be gone again after: 9055661b958753d ...so I'll close this report. But as I wrote earlier, you may want to adjust test expectations for FP math with NaN values.

If I missed something, please re-open.

wjristow commented 1 year ago

Thanks for fixing this so quickly @rotateright.

Closing the loop on a couple of comments/questions above:

Regarding:

A specific NaN value is replaced with a canonical NaN value. We're not concerned about anything else in this report, are we?

Yes, it wasn't obvious at the start, but after analysis, that's what this ultimately came down to.

Regarding:

But also note that there is no requirement that we preserve a NaN payload in IEEE-754, so technically, there is no bug here. Ie, we give our best effort to not change NaN bits, but users/Alive2 can't rely on that behavior.

True that it isn't a requirement to preserve the payload, and so it's fair to say this is not a bug. But the IEEE standard (IEEE Std 754™-2008) in section 6.2.3 NaN propagation, starts by saying:

An operation that propagates a NaN operand to its result and has a single NaN as an input should produce a NaN with the payload of the input NaN if representable in the destination format.

Since it says that it should produce a NaN with the same payload, rather than must produce that NaN, then I agree that it's technically not a bug. And more directly, code shouldn't rely on it. So that argues that technically the test-case was "buggy". But that said, producing the same payload is certainly better than changing it.

rotateright commented 1 year ago

Thanks for summarizing, @wjristow !

I'd just add that LLVM default FP is not fully IEEE-754 compliant; we have "-ffp-exception-behavior" and other flags that try to model that environment. By default, we're trying to balance compliance and perf. And as you already know, we can tilt much more towards perf with fast-math-flags.

@dyung - for test programs that do FP math, it might be more valuable to init with FP numbers instead of semi-arbitrary byte values. That way, you can verify that a valid known result is produced all the way through codegen.

This was an easy fix, so no real controversy. But you might be interested in following progress on issue #59279 / https://reviews.llvm.org/D139785 - we're forced to make a trade-off on that one to reduce optimization power to maintain some kind of semantic soundness.

wjristow commented 1 year ago

Thanks @rotateright for pointing me at that other (more interesting, from a performance perspective) issue! And yes, control over the FP behavior in LLVM has improved dramatically over the last few years (e.g., with "-ffp-exception-behavior").

gregbedwell commented 1 year ago

@dyung - for test programs that do FP math, it might be more valuable to init with FP numbers instead of semi-arbitrary byte values. That way, you can verify that a valid known result is produced all the way through codegen.

This is my fault. It's a test generator I wrote a long time ago that has caught a pretty large number of (legit) bugs over the years. I think this is the first case of a potentially not-a-bug in a while. Coincidentally, we're just starting work on a new and much more advanced version so I will make sure to bake some better protections against just this sort of issue into the design. Thanks!

llvm / llvm-project

Bad codegen after 8adfa29706e (NaN constant folding weirdness) #59122