Open MattPD opened 3 months ago
@llvm/issue-subscribers-openmp
Author: Matt (MattPD)
Some notes on tracking this down and the observed differences between the two versions in terms of the loop vectorizer decisions (VF stands for vectorization factor, UF stands for unroll factor):
remark: vectorized loop (vectorization width: 4, interleaved count: 2) [-Rpass=loop-vectorize]
and the comparison failsvectorized loop (vectorization width: 2, interleaved count: 1) [-Rpass=loop-vectorize]
) and the execution terminates with Success!
.Experimenting with overriding VF,UF decisions of the loop vectorizer:
-mllvm -force-vector-width=2
still fails: https://godbolt.org/z/MqcfrvK3n-mllvm -force-vector-width=4 -mllvm -force-vector-interleave=1
fails as well-mllvm -force-vector-width=2 -mllvm -force-vector-interleave=1
passes: https://godbolt.org/z/ecG4nKeds-mllvm -force-vector-width=2 -mllvm -force-vector-interleave=2
as well as with -mllvm -force-vector-width=4 -mllvm -force-vector-interleave=2
. Thus, consequently, it encounters no (vectorization-related or not) miscompilation issues.-mllvm -force-vector-interleave=2
alone and this is still correct: https://godbolt.org/z/GMvfqdYW4Thus, loop vectorization with VF=2,UF=2 works correctly for v17.0.1 (with the UF=2 being implicit rather than forced) but v18.1.0 doesn't (whether with -mllvm -force-vector-width=2
alone or with -mllvm -force-vector-width=2 -mllvm -force-vector-interleave=2
both).
Forcing VF=2,UF=1 using -mllvm -force-vector-width=2 -mllvm -force-vector-interleave=1
results in correct execution for v18.1.0.
While it initially seemed that the higher UF is the incorrect choice for v18.1.0, this doesn't explain why the forced UF=2 isn't a problem for v17.0.1.
Some notes on isolation:
Now, let's compare LLVM IR between v17.0.1 with -mllvm -force-vector-interleave=2
and v18.1.0 with -mllvm -force-vector-width=2 -mllvm -force-vector-interleave=2
.
This way both versions are making the same vectorization decision, vectorized loop (vectorization width: 2, interleaved count: 2) so we can minimize any spurious differences and focus on the salient ones alone.
Command lines used:
-fopenmp -g0 -Ofast -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -Rpass-missed=loop-vectorize -emit-llvm -mllvm -force-vector-interleave=2
-fopenmp -g0 -Ofast -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -Rpass-missed=loop-vectorize -emit-llvm -mllvm -force-vector-width=2 -mllvm -force-vector-interleave=2
https://godbolt.org/z/Ej5E4vrca
LLVM IR diff (LHS: v17.0.1, RHS: v18.1.0): https://editor.mergely.com/0z3MCX6t
The only obvious difference is that v17.0.1 has two more instructions in the omp.inner.for.cond.preheader
basic block:
%broadcast.splatinsert26 = insertelement <2 x float> poison, float %1, i64 0, !dbg !20
%broadcast.splat27 = shufflevector <2 x float> %broadcast.splatinsert26, <2 x float> poison, <2 x i32> zeroinitializer, !dbg !20
The remaining differences do not appear significant:
vector.body
and vector.body.1
look like minor scheduling differences but still satisfying the same RAW dependencies AFAICTBoth versions produce identical assembly:
https://godbolt.org/z/5oKMhWeT5
Comparing v18.1.0 VF=2,UF=2 vs. VF=2,UF=1
Recall that:
-mllvm -force-vector-width=2
still fails-mllvm -force-vector-width=2 -mllvm -force-vector-interleave=1
passesLLVM IR does seem to indicate unrolling alone (in particular, vector.body
gets an extra load, add, GEP, and store):
https://editor.mergely.com/7U8bh2zu
There's a bit more complicated assembly (horizontal op, movlhps
):
https://editor.mergely.com/rcumi22m
Unclear whether there's anything problematic in this stage, unless the unrolling decision is incorrect.
Miscompiled function(s)
Recall that:
-mllvm -force-vector-width=2
fails: https://godbolt.org/z/MqcfrvK3n-mllvm -force-vector-interleave=2
passes: https://godbolt.org/z/GMvfqdYW4As both of the above make the same loop vectorization decision (vectorized loop (vectorization width: 2, interleaved count: 2)
) this is the closest available baseline for comparison.
When compiling with -fopenmp
(where the failure for 18.1.0 is present), the assembly code for initialization_loop
and omp_simd_loop
is identical: It's only the comparison_loop
(which does not use any OpenMP pragmas) that differs:
LHS=17, RHS=18: https://editor.mergely.com/AQ8BXfRM
However, compiling without -fopenmp
results in "Success!" for LLVM 18.1.0, too.
FWIW, still getting 2,210
errors even with the relative comparison tolerance changed to scalar = 1000000.0
(from 1.0
) so it doesn't seem like a minor numerical difference, either.
In contrast, comparing LHS=18_without_fopenmp (passes) against RHS=18_with_openmp (fails):
https://editor.mergely.com/7i0QFyIx
This time only the assembly code for omp_simd_loop
differs when comparing between LLVM 18.1.0 without -fopenmp
(passing) and LLVM 18.1.0 with -fopenmp
(failing).
But recall again that the assembly code for omp_simd_loop
is exactly identical between LLVM 17.0.1 with -fopenmp
(passing) and LLVM 18.1.0 with -fopenmp
(failing).
Given all of the above, I still believe it's primarily an OpenMP issue on the grounds that it doesn't happen when compiling without -fopenmp
but seeing identical assembly code for omp_simd_loop
(which is the only function using OpenMP pragmas) when comparing passing LLVM 17.0.1 vs. failing LLVM 18.1.0 is quite a bit puzzling. It's quite possible that I've missed something so feel free to take all of the analysis with a grain of salt and consider the original bug report comment alone.
This might be caused by front end changes because libomp
has been quite stable for a long time.
CC @kparzysz @alexey-bataev as you may be familiar with this part of the codebase
This may be a regression between LLVM version 17.0.1 and 18.1.0. The issue is still present in the main branch as of version 19.0.0 (dbc3e26c25587e5460ae12caed84cb09197c4ed7).
Consider the following loop:
We have that:
However, as of LLVM 18.1.0 when we:
omp_simd_loop
using#pragma omp simd
and#pragma omp ordered simd
comparison_loop
(which is otherwise the same loop without any#pragma omp
)1000000.0 * FLT_EPSILON
)We have 12,090 errors for the code compiled with LLVM 18.1.0 but 0 errors for the code compiled with LLVM 17.0.1.
Compiler Explorer repro:
totalErrors_simd: 12090
,FAIL: error in ordered simd computation.
Success!
The bug is only present when compiling with -fopenmp (compiling without -fopenmp makes LLVM 18.1.0 pass). Removing all
#pragma omp
also makes this pass. Using#pragma omp simd safelen(2)
instead of#pragma omp simd
is similarly sufficient: But this effectively makes#pragma omp ordered simd
unnecessary. The above would strongly indicate this is an OpenMP issue. However, when attempting to track this down--and in particular analyze the interactions with different loop vectorizer decisions between LLVM 17.0.1 and 18.1.0--I've run into some "interesting" challenges (notes on the findings in the next comment to keep this one short).This may be related to an earlier bug (although note that this one is a bit simpler in that it doesn't use
printf
inside the loop which currently prevents vectorization and thus does not reproduce for me at the time of writing):[OpenMP 4.5] ORDERED SIMD construct in loop SIMD doesn't work as required by the specification https://github.com/llvm/llvm-project/issues/51043
Full repro source code (for completeness only: the aforementioned Compiler Explorer repros are identical):