Closed vzakhari closed 11 months ago
No other reports for now, please provide reproducer
I think I found the issue.
I cannot share full source code. Moreover, the upstream flang-new
driver does not allow enabling SLP vectorizer easily, so the source code might not be that useful.
I was able to reduce the problem loop code to the following:
! Problem loop:
print *, '--', x_position_numerical
print *, '--', y_position_numerical
print *, '--', x_acceleration
print *, '--', y_acceleration
do while (time <= stop_time)
time = time + 0.025_LONGreal
numerical_radius_squared = x_position_numerical(11)**2 + y_position_numerical(11)**2
x_acceleration(11) = -(0.01720209895_LONGreal**2)/numerical_radius_squared**1.5_LONGreal*x_position_numerical(11)
y_acceleration(11) = -(0.01720209895_LONGreal**2)/numerical_radius_squared**1.5_LONGreal*y_position_numerical(11)
x_alpha_sum = dot_product(alpha(0:11:1),x_position_numerical(0:11:1))
x_beta_sum = dot_product(beta(0:11:1),x_acceleration(0:11:1))
y_alpha_sum = dot_product(alpha(0:11:1),y_position_numerical(0:11:1))
y_beta_sum = dot_product(beta(0:11:1),y_acceleration(0:11:1))
x_position_numerical(12) = (-x_alpha_sum) + (0.000625_LONGreal/53222400.0_LONGreal)*(x_beta_sum)
y_position_numerical(12) = (-y_alpha_sum) + (0.000625_LONGreal/53222400.0_LONGreal)*(y_beta_sum)
x_gamma_sum = dot_product(gamma(0:12:1),x_acceleration(11:(-1):(-1)))
y_gamma_sum = dot_product(gamma(0:12:1),y_acceleration(11:(-1):(-1)))
x_dot_numerical = 40.0_LONGreal*(x_position_numerical(11)-x_position_numerical(10)) + (0.025_LONGreal/1743565824000.0_LONGreal)*(x_gamma_sum)
y_dot_numerical = 40.0_LONGreal*(y_position_numerical(11)-y_position_numerical(10)) + (0.025_LONGreal/1743565824000.0_LONGreal)*(y_gamma_sum)
do j = -1,11
if (j >= 0) then
x_position_numerical(j) = x_position_numerical(j+1)
y_position_numerical(j) = y_position_numerical(j+1)
endif
if (j < 11) then
x_acceleration(j) = x_acceleration(j+1)
y_acceleration(j) = y_acceleration(j+1)
endif
end do
end do
print *, 'after loop'
print *, x_position_numerical
print *, x_dot_numerical, y_dot_numerical
The loop runs lots of iterations and most of the computations are interdependent. The problem appears as invalid x_position_numerical
values after the loop. Even though the x_gamma_sum
, y_gamma_sum
, x_dot_numerical
and y_dot_numerical
computations are not used for x_position_numerical
computation the tests pass if I remove them.
Some notes about the arrays used in the loop:
real (kind = LONGreal), dimension(0:12) :: x_position_numerical,y_position_numerical
real (kind = LONGreal), dimension(-1:11) :: x_acceleration,y_acceleration
real (kind = LONGreal), dimension(0:12) :: alpha, beta, gamma
alpha
, beta
and gamma
arrays are initialized with constants before the loop, so optimizations before SLP are able to propagate the constants (so in the LLVM IR dumps we will see fmul
's with constants expanded from dot_product
intrinsics)I was able to see the error after SLP when I tried reordering the code inside the loop, e.g. it fails with this ordering:
x_alpha_sum = dot_product(alpha(0:11:1),x_position_numerical(0:11:1))
x_beta_sum = dot_product(beta(0:11:1),x_acceleration(0:11:1))
y_alpha_sum = dot_product(alpha(0:11:1),y_position_numerical(0:11:1))
y_beta_sum = dot_product(beta(0:11:1),y_acceleration(0:11:1))
x_position_numerical(12) = (-x_alpha_sum) + (0.000625_LONGreal/53222400.0_LONGreal)*(x_beta_sum)
y_position_numerical(12) = (-y_alpha_sum) + (0.000625_LONGreal/53222400.0_LONGreal)*(y_beta_sum)
And it passes with this ordering:
x_alpha_sum = dot_product(alpha(0:11:1),x_position_numerical(0:11:1))
x_beta_sum = dot_product(beta(0:11:1),x_acceleration(0:11:1))
! The line hoisted here:
x_position_numerical(12) = (-x_alpha_sum) + (0.000625_LONGreal/53222400.0_LONGreal)*(x_beta_sum)
y_alpha_sum = dot_product(alpha(0:11:1),y_position_numerical(0:11:1))
y_beta_sum = dot_product(beta(0:11:1),y_acceleration(0:11:1))
y_position_numerical(12) = (-y_alpha_sum) + (0.000625_LONGreal/53222400.0_LONGreal)*(y_beta_sum)
I will refer to the first ordering as failord
and to the second ordering as passord
below. Comparing the SLP behavior for these two versions exposes the problem.
In the attached archive SLPdumps.zip you may find LLVM IR dumps before and after SLP for the failord
and passord
versions.
LLVM IR before SLP is almost the same except for the value numbering and the placement of the computation and the store for x_position_numerical(12) = (-x_alpha_sum) + (0.000625_LONGreal/53222400.0_LONGreal)*(x_beta_sum)
.
I see incorrect LLVM IR generated by SLP for this piece of code (extract from passord.beforeslp.ll
):
%110 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 2), align 16, !tbaa !41
%111 = fmul reassoc nnan ninf nsz arcp contract double %110, 0x4195B16D54000000
%112 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 3), align 8, !tbaa !41
%113 = fmul reassoc nnan ninf nsz arcp contract double %112, 0xC1AB5EBD4C000000
%114 = fadd reassoc nnan ninf nsz arcp contract double %113, %111
%115 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 4), align 16, !tbaa !41
%116 = fmul reassoc nnan ninf nsz arcp contract double %115, 0x41C837DA70800000
%117 = fadd reassoc nnan ninf nsz arcp contract double %116, %114
%118 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 5), align 8, !tbaa !41
%119 = fmul reassoc nnan ninf nsz arcp contract double %118, 0xC1D844630A000000
%120 = fadd reassoc nnan ninf nsz arcp contract double %119, %117
%121 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 6), align 16, !tbaa !41
%122 = fmul reassoc nnan ninf nsz arcp contract double %121, 0x41E43A6599400000
%123 = fadd reassoc nnan ninf nsz arcp contract double %122, %120
%124 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 7), align 8, !tbaa !41
%125 = fmul reassoc nnan ninf nsz arcp contract double %124, 0xC1E6A9F50C800000
%126 = fadd reassoc nnan ninf nsz arcp contract double %125, %123
%127 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 8), align 16, !tbaa !41
%128 = fmul reassoc nnan ninf nsz arcp contract double %127, 0x41E43A6599400000
%129 = fadd reassoc nnan ninf nsz arcp contract double %128, %126
%130 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 9), align 8, !tbaa !41
%131 = fmul reassoc nnan ninf nsz arcp contract double %130, 0xC1D844630A000000
%132 = fadd reassoc nnan ninf nsz arcp contract double %131, %129
%133 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 10), align 16, !tbaa !41
%134 = fmul reassoc nnan ninf nsz arcp contract double %133, 0x41C837DA70800000
%135 = fadd reassoc nnan ninf nsz arcp contract double %134, %132
%136 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 11), align 8, !tbaa !41
%137 = fmul reassoc nnan ninf nsz arcp contract double %136, 0xC1AB5EBD4C000000
%138 = fadd reassoc nnan ninf nsz arcp contract double %137, %135
%139 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 12), align 16, !tbaa !41
%140 = fmul reassoc nnan ninf nsz arcp contract double %139, 0x4195B16D54000000
%141 = fadd reassoc nnan ninf nsz arcp contract double %140, %138
This is a dot_product
reduction for y_beta_sum = dot_product(beta(0:11:1),y_acceleration(0:11:1))
.
For the passord
version SLP optimizes it like this (see passord.afterslp.ll
):
%74 = load double, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 2), align 16, !tbaa !41
%75 = load <8 x double>, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 5), align 8, !tbaa !41
%76 = extractelement <8 x double> %75, i32 0
...
%98 = load <4 x double>, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 1), align 8
%99 = shufflevector <4 x double> %98, <4 x double> poison, <8 x i32> <i32 poison, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
%100 = shufflevector <8 x double> %99, <8 x double> %75, <8 x i32> <i32 poison, i32 1, i32 2, i32 poison, i32 9, i32 10, i32 poison, i32 poison>
%101 = shufflevector <8 x double> %100, <8 x double> %75, <8 x i32> <i32 poison, i32 1, i32 2, i32 poison, i32 4, i32 5, i32 11, i32 12>
%102 = insertelement <8 x double> %101, double %74, i32 0
%103 = insertelement <8 x double> %102, double %76, i32 3
%104 = fmul reassoc nnan ninf nsz arcp contract <8 x double> %103, <double 0x4195B16D54000000, double 0xC1AB5EBD4C000000, double 0x41C837DA70800000, double 0xC1D844630A000000, double 0x41E43A6599400000, double 0xC1E6A9F50C800000, double 0x41E43A6599400000, double 0xC1D844630A000000>
%105 = call reassoc nnan ninf nsz arcp contract double @llvm.vector.reduce.fadd.v8f64(double -0.000000e+00, <8 x double> %104)
Vector %103
holds elements <2, 3, 4, 5, 6, 7, 8, 9>
of array _QFEy_acceleration
(0-based indexing). The vector is then multiplied by the constants from beta
and horizontally reduced.
For the failord
version SLP generates the following (see failord.afterslp.ll
):
%74 = load <2 x double>, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 2), align 16, !tbaa !41
%75 = load <8 x double>, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 5), align 8, !tbaa !41
%76 = extractelement <8 x double> %75, i32 0
...
%98 = load <4 x double>, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 1), align 8
%99 = shufflevector <4 x double> %98, <4 x double> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison>
%100 = shufflevector <8 x double> %99, <8 x double> %75, <8 x i32> <i32 3, i32 poison, i32 poison, i32 poison, i32 9, i32 10, i32 poison, i32 poison>
%101 = shufflevector <8 x double> %100, <8 x double> %75, <8 x i32> <i32 0, i32 poison, i32 poison, i32 poison, i32 4, i32 5, i32 11, i32 12>
%102 = shufflevector <2 x double> %74, <2 x double> poison, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
%103 = shufflevector <8 x double> %101, <8 x double> %102, <8 x i32> <i32 8, i32 9, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%104 = insertelement <8 x double> %103, double %76, i32 3
%105 = fmul reassoc nnan ninf nsz arcp contract <8 x double> %104, <double 0x4195B16D54000000, double 0xC1AB5EBD4C000000, double 0x41C837DA70800000, double 0xC1D844630A000000, double 0x41E43A6599400000, double 0xC1E6A9F50C800000, double 0x41E43A6599400000, double 0xC1D844630A000000>
%106 = call reassoc nnan ninf nsz arcp contract double @llvm.vector.reduce.fadd.v8f64(double -0.000000e+00, <8 x double> %105)
Vector %104
holds the following elements <2, 3, poison, 5, 6, 7, 8, 9>
. So the whole result of the following horizontal reduction is poison.
The issue may be reproduced with the following command (I ran it on an icelake machine): opt --passes="function(slp-vectorizer)" failord.beforeslp.ll -S --mcpu=icelake-server
@alexey-bataev please let me know if you have problems with reproducing the issue or if you need any additional information.
Hmm, I don't see the same result as you do. In your result you have %104 = insertelement <8 x double> %103, double %76, i32 3, but using the original open source compiler I'm getting %102 = insertelement <8 x double> %101, double %98, i32 2. What is your compiler version, is this original compiler or modified?
I was using an unmodified upstream clang
/llvm
. I think it was at some two-days old revision. Let me check it with the latest and get back to you.
Sorry, I did not check thoroughly the output produced by the opt
command that I posted for the reproducer. I indeed see what you see with the latest opt
.
I can reproduce the issue with the latest opt
and without the --mcpu=
option: opt --passes="function(slp-vectorizer)" failord.beforeslp.ll -S
:
%74 = load <2 x double>, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 2), align 16, !tbaa !41
%75 = load <8 x double>, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 5), align 8, !tbaa !41
%76 = extractelement <8 x double> %75, i32 0
...
%98 = load <4 x double>, ptr getelementptr inbounds ([13 x double], ptr @_QFEy_acceleration, i64 0, i64 1), align 8
%99 = shufflevector <4 x double> %98, <4 x double> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison>
%100 = shufflevector <8 x double> %99, <8 x double> %75, <8 x i32> <i32 3, i32 poison, i32 poison, i32 poison, i32 9, i32 10, i32 poison, i32 poison>
%101 = shufflevector <8 x double> %100, <8 x double> %75, <8 x i32> <i32 0, i32 poison, i32 poison, i32 poison, i32 4, i32 5, i32 11, i32 12>
%102 = shufflevector <2 x double> %74, <2 x double> poison, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
%103 = shufflevector <8 x double> %101, <8 x double> %102, <8 x i32> <i32 8, i32 9, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%104 = insertelement <8 x double> %103, double %76, i32 3
%105 = fmul reassoc nnan ninf nsz arcp contract <8 x double> %104, <double 0x4195B16D54000000, double 0xC1AB5EBD4C000000, double 0x41C837DA70800000, double 0xC1D844630A000000, double 0x41E43A6599400000, double 0xC1E6A9F50C800000, double 0x41E43A6599400000, double 0xC1D844630A000000>
%106 = call reassoc nnan ninf nsz arcp contract double @llvm.vector.reduce.fadd.v8f64(double -0.000000e+00, <8 x double> %105)
./bin/opt --version
LLVM (http://llvm.org/):
LLVM version 18.0.0git
DEBUG build with assertions.
Default target: x86_64-unknown-linux-gnu
Host CPU: icelake-server
Ok, thanks, will check once again
Fixed in 95703642e3ef617275fd80b5316b05c5a09c6219
Thank you! It is fixed.
The benchmark started failing with Flang (-ffast-math) on Haswell after ac254fc055980219b30821c3717c6b7db0fbbc46 Reverting the change makes the test pass. The change enables vectorization for some reductions in the code, so there is a possibility that this is a test issue (though, the results are way off). I will try to create a minimal reproducer if I end up proving that this is an SLP issue.
@alexey-bataev please let me know if you know of any other reported issues with this commit or any additional change that I can try.