[SLP] polyhedron/kepler tolerance failure

vzakhari commented 11 months ago

The benchmark started failing with Flang (-ffast-math) on Haswell after ac254fc055980219b30821c3717c6b7db0fbbc46 Reverting the change makes the test pass. The change enables vectorization for some reductions in the code, so there is a possibility that this is a test issue (though, the results are way off). I will try to create a minimal reproducer if I end up proving that this is an SLP issue.

@alexey-bataev please let me know if you know of any other reported issues with this commit or any additional change that I can try.

alexey-bataev commented 11 months ago

No other reports for now, please provide reproducer

vzakhari commented 11 months ago

I think I found the issue.

I cannot share full source code. Moreover, the upstream flang-new driver does not allow enabling SLP vectorizer easily, so the source code might not be that useful.

I was able to reduce the problem loop code to the following:

! Problem loop:
print *, '--', x_position_numerical
print *, '--', y_position_numerical
print *, '--', x_acceleration
print *, '--', y_acceleration
do while (time <= stop_time)
   time = time + 0.025_LONGreal
   numerical_radius_squared = x_position_numerical(11)**2 + y_position_numerical(11)**2
   x_acceleration(11) = -(0.01720209895_LONGreal**2)/numerical_radius_squared**1.5_LONGreal*x_position_numerical(11)
   y_acceleration(11) = -(0.01720209895_LONGreal**2)/numerical_radius_squared**1.5_LONGreal*y_position_numerical(11)

   x_alpha_sum = dot_product(alpha(0:11:1),x_position_numerical(0:11:1))
   x_beta_sum = dot_product(beta(0:11:1),x_acceleration(0:11:1))
   y_alpha_sum = dot_product(alpha(0:11:1),y_position_numerical(0:11:1))
   y_beta_sum = dot_product(beta(0:11:1),y_acceleration(0:11:1))

   x_position_numerical(12) = (-x_alpha_sum) + (0.000625_LONGreal/53222400.0_LONGreal)*(x_beta_sum)
   y_position_numerical(12) = (-y_alpha_sum) + (0.000625_LONGreal/53222400.0_LONGreal)*(y_beta_sum)

   x_gamma_sum = dot_product(gamma(0:12:1),x_acceleration(11:(-1):(-1)))
   y_gamma_sum = dot_product(gamma(0:12:1),y_acceleration(11:(-1):(-1)))

   x_dot_numerical = 40.0_LONGreal*(x_position_numerical(11)-x_position_numerical(10)) + (0.025_LONGreal/1743565824000.0_LONGreal)*(x_gamma_sum)
   y_dot_numerical = 40.0_LONGreal*(y_position_numerical(11)-y_position_numerical(10)) + (0.025_LONGreal/1743565824000.0_LONGreal)*(y_gamma_sum)

   do j = -1,11
      if (j >= 0) then
         x_position_numerical(j) = x_position_numerical(j+1)
         y_position_numerical(j) = y_position_numerical(j+1)
      endif
      if (j < 11) then
         x_acceleration(j) = x_acceleration(j+1)
         y_acceleration(j) = y_acceleration(j+1)
      endif
   end do
end do
print *, 'after loop'
print *, x_position_numerical
print *, x_dot_numerical, y_dot_numerical

The loop runs lots of iterations and most of the computations are interdependent. The problem appears as invalid x_position_numerical values after the loop. Even though the x_gamma_sum, y_gamma_sum, x_dot_numerical and y_dot_numerical computations are not used for x_position_numerical computation the tests pass if I remove them.

Some notes about the arrays used in the loop:

They all have 13 elements:

real (kind = LONGreal), dimension(0:12) :: x_position_numerical,y_position_numerical
real (kind = LONGreal), dimension(-1:11) :: x_acceleration,y_acceleration
real (kind = LONGreal), dimension(0:12) :: alpha, beta, gamma

alpha, beta and gamma arrays are initialized with constants before the loop, so optimizations before SLP are able to propagate the constants (so in the LLVM IR dumps we will see fmul's with constants expanded from dot_product intrinsics)

I was able to see the error after SLP when I tried reordering the code inside the loop, e.g. it fails with this ordering:

   x_alpha_sum = dot_product(alpha(0:11:1),x_position_numerical(0:11:1))
   x_beta_sum = dot_product(beta(0:11:1),x_acceleration(0:11:1))
   y_alpha_sum = dot_product(alpha(0:11:1),y_position_numerical(0:11:1))
   y_beta_sum = dot_product(beta(0:11:1),y_acceleration(0:11:1))

   x_position_numerical(12) = (-x_alpha_sum) + (0.000625_LONGreal/53222400.0_LONGreal)*(x_beta_sum)
   y_position_numerical(12) = (-y_alpha_sum) + (0.000625_LONGreal/53222400.0_LONGreal)*(y_beta_sum)

And it passes with this ordering:

   x_alpha_sum = dot_product(alpha(0:11:1),x_position_numerical(0:11:1))
   x_beta_sum = dot_product(beta(0:11:1),x_acceleration(0:11:1))

   ! The line hoisted here:
   x_position_numerical(12) = (-x_alpha_sum) + (0.000625_LONGreal/53222400.0_LONGreal)*(x_beta_sum)

   y_alpha_sum = dot_product(alpha(0:11:1),y_position_numerical(0:11:1))
   y_beta_sum = dot_product(beta(0:11:1),y_acceleration(0:11:1))

   y_position_numerical(12) = (-y_alpha_sum) + (0.000625_LONGreal/53222400.0_LONGreal)*(y_beta_sum)

I will refer to the first ordering as failord and to the second ordering as passord below. Comparing the SLP behavior for these two versions exposes the problem.

In the attached archive SLPdumps.zip you may find LLVM IR dumps before and after SLP for the failord and passord versions.

LLVM IR before SLP is almost the same except for the value numbering and the placement of the computation and the store for x_position_numerical(12) = (-x_alpha_sum) + (0.000625_LONGreal/53222400.0_LONGreal)*(x_beta_sum).

I see incorrect LLVM IR generated by SLP for this piece of code (extract from passord.beforeslp.ll):

  %110 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 2), align 16, !tbaa !41
  %111 = fmul reassoc nnan ninf nsz arcp contract double %110, 0x4195B16D54000000
  %112 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 3), align 8, !tbaa !41
  %113 = fmul reassoc nnan ninf nsz arcp contract double %112, 0xC1AB5EBD4C000000
  %114 = fadd reassoc nnan ninf nsz arcp contract double %113, %111
  %115 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 4), align 16, !tbaa !41
  %116 = fmul reassoc nnan ninf nsz arcp contract double %115, 0x41C837DA70800000
  %117 = fadd reassoc nnan ninf nsz arcp contract double %116, %114
  %118 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 5), align 8, !tbaa !41
  %119 = fmul reassoc nnan ninf nsz arcp contract double %118, 0xC1D844630A000000
  %120 = fadd reassoc nnan ninf nsz arcp contract double %119, %117
  %121 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 6), align 16, !tbaa !41
  %122 = fmul reassoc nnan ninf nsz arcp contract double %121, 0x41E43A6599400000
  %123 = fadd reassoc nnan ninf nsz arcp contract double %122, %120
  %124 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 7), align 8, !tbaa !41
  %125 = fmul reassoc nnan ninf nsz arcp contract double %124, 0xC1E6A9F50C800000
  %126 = fadd reassoc nnan ninf nsz arcp contract double %125, %123
  %127 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 8), align 16, !tbaa !41
  %128 = fmul reassoc nnan ninf nsz arcp contract double %127, 0x41E43A6599400000
  %129 = fadd reassoc nnan ninf nsz arcp contract double %128, %126
  %130 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 9), align 8, !tbaa !41
  %131 = fmul reassoc nnan ninf nsz arcp contract double %130, 0xC1D844630A000000
  %132 = fadd reassoc nnan ninf nsz arcp contract double %131, %129
  %133 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 10), align 16, !tbaa !41
  %134 = fmul reassoc nnan ninf nsz arcp contract double %133, 0x41C837DA70800000
  %135 = fadd reassoc nnan ninf nsz arcp contract double %134, %132
  %136 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 11), align 8, !tbaa !41
  %137 = fmul reassoc nnan ninf nsz arcp contract double %136, 0xC1AB5EBD4C000000
  %138 = fadd reassoc nnan ninf nsz arcp contract double %137, %135
  %139 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 12), align 16, !tbaa !41
  %140 = fmul reassoc nnan ninf nsz arcp contract double %139, 0x4195B16D54000000
  %141 = fadd reassoc nnan ninf nsz arcp contract double %140, %138

This is a dot_product reduction for y_beta_sum = dot_product(beta(0:11:1),y_acceleration(0:11:1)).

For the passord version SLP optimizes it like this (see passord.afterslp.ll):

  %74 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 2), align 16, !tbaa !41
  %75 = load <8 x double>, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 5), align 8, !tbaa !41
  %76 = extractelement <8 x double> %75, i32 0
...
  %98 = load <4 x double>, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 1), align 8
  %99 = shufflevector <4 x double> %98, <4 x double> poison, <8 x i32> <i32 poison, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %100 = shufflevector <8 x double> %99, <8 x double> %75, <8 x i32> <i32 poison, i32 1, i32 2, i32 poison, i32 9, i32 10, i32 poison, i32 poison>
  %101 = shufflevector <8 x double> %100, <8 x double> %75, <8 x i32> <i32 poison, i32 1, i32 2, i32 poison, i32 4, i32 5, i32 11, i32 12>
  %102 = insertelement <8 x double> %101, double %74, i32 0
  %103 = insertelement <8 x double> %102, double %76, i32 3
  %104 = fmul reassoc nnan ninf nsz arcp contract <8 x double> %103, <double 0x4195B16D54000000, double 0xC1AB5EBD4C000000, double 0x41C837DA70800000, double 0xC1D844630A000000, double 0x41E43A6599400000, double 0xC1E6A9F50C800000, double 0x41E43A6599400000, double 0xC1D844630A000000>
  %105 = call reassoc nnan ninf nsz arcp contract double @llvm.vector.reduce.fadd.v8f64(double -0.000000e+00, <8 x double> %104)

Vector %103 holds elements <2, 3, 4, 5, 6, 7, 8, 9> of array _QFEy_acceleration (0-based indexing). The vector is then multiplied by the constants from beta and horizontally reduced.

For the failord version SLP generates the following (see failord.afterslp.ll):

  %74 = load <2 x double>, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 2), align 16, !tbaa !41
  %75 = load <8 x double>, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 5), align 8, !tbaa !41
  %76 = extractelement <8 x double> %75, i32 0
...
  %98 = load <4 x double>, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 1), align 8
  %99 = shufflevector <4 x double> %98, <4 x double> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison>
  %100 = shufflevector <8 x double> %99, <8 x double> %75, <8 x i32> <i32 3, i32 poison, i32 poison, i32 poison, i32 9, i32 10, i32 poison, i32 poison>
  %101 = shufflevector <8 x double> %100, <8 x double> %75, <8 x i32> <i32 0, i32 poison, i32 poison, i32 poison, i32 4, i32 5, i32 11, i32 12>
  %102 = shufflevector <2 x double> %74, <2 x double> poison, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %103 = shufflevector <8 x double> %101, <8 x double> %102, <8 x i32> <i32 8, i32 9, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %104 = insertelement <8 x double> %103, double %76, i32 3
  %105 = fmul reassoc nnan ninf nsz arcp contract <8 x double> %104, <double 0x4195B16D54000000, double 0xC1AB5EBD4C000000, double 0x41C837DA70800000, double 0xC1D844630A000000, double 0x41E43A6599400000, double 0xC1E6A9F50C800000, double 0x41E43A6599400000, double 0xC1D844630A000000>
  %106 = call reassoc nnan ninf nsz arcp contract double @llvm.vector.reduce.fadd.v8f64(double -0.000000e+00, <8 x double> %105)

Vector %104 holds the following elements <2, 3, poison, 5, 6, 7, 8, 9>. So the whole result of the following horizontal reduction is poison.

The issue may be reproduced with the following command (I ran it on an icelake machine): opt --passes="function(slp-vectorizer)" failord.beforeslp.ll -S --mcpu=icelake-server

vzakhari commented 11 months ago

@alexey-bataev please let me know if you have problems with reproducing the issue or if you need any additional information.

alexey-bataev commented 11 months ago

Hmm, I don't see the same result as you do. In your result you have %104 = insertelement <8 x double> %103, double %76, i32 3, but using the original open source compiler I'm getting %102 = insertelement <8 x double> %101, double %98, i32 2. What is your compiler version, is this original compiler or modified?

vzakhari commented 11 months ago

I was using an unmodified upstream clang/llvm. I think it was at some two-days old revision. Let me check it with the latest and get back to you.

vzakhari commented 11 months ago

Sorry, I did not check thoroughly the output produced by the opt command that I posted for the reproducer. I indeed see what you see with the latest opt.

I can reproduce the issue with the latest opt and without the --mcpu= option: opt --passes="function(slp-vectorizer)" failord.beforeslp.ll -S:

  %74 = load <2 x double>, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 2), align 16, !tbaa !41
  %75 = load <8 x double>, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 5), align 8, !tbaa !41
  %76 = extractelement <8 x double> %75, i32 0
...
  %98 = load <4 x double>, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 1), align 8
  %99 = shufflevector <4 x double> %98, <4 x double> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison>
  %100 = shufflevector <8 x double> %99, <8 x double> %75, <8 x i32> <i32 3, i32 poison, i32 poison, i32 poison, i32 9, i32 10, i32 poison, i32 poison>
  %101 = shufflevector <8 x double> %100, <8 x double> %75, <8 x i32> <i32 0, i32 poison, i32 poison, i32 poison, i32 4, i32 5, i32 11, i32 12>
  %102 = shufflevector <2 x double> %74, <2 x double> poison, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %103 = shufflevector <8 x double> %101, <8 x double> %102, <8 x i32> <i32 8, i32 9, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %104 = insertelement <8 x double> %103, double %76, i32 3
  %105 = fmul reassoc nnan ninf nsz arcp contract <8 x double> %104, <double 0x4195B16D54000000, double 0xC1AB5EBD4C000000, double 0x41C837DA70800000, double 0xC1D844630A000000, double 0x41E43A6599400000, double 0xC1E6A9F50C800000, double 0x41E43A6599400000, double 0xC1D844630A000000>
  %106 = call reassoc nnan ninf nsz arcp contract double @llvm.vector.reduce.fadd.v8f64(double -0.000000e+00, <8 x double> %105)

./bin/opt --version
LLVM (http://llvm.org/):
  LLVM version 18.0.0git
  DEBUG build with assertions.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: icelake-server

alexey-bataev commented 11 months ago

Ok, thanks, will check once again

alexey-bataev commented 11 months ago

Fixed in 95703642e3ef617275fd80b5316b05c5a09c6219

vzakhari commented 11 months ago

Thank you! It is fixed.

llvm / llvm-project

[SLP] polyhedron/kepler tolerance failure #72202