llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.33k stars 11.7k forks source link

[OpenMP][SIMD] `ordered` has no effect in a loop SIMD region as of LLVM 18.1.0 #95611

Open MattPD opened 3 months ago

MattPD commented 3 months ago

This may be a regression between LLVM version 17.0.1 and 18.1.0. The issue is still present in the main branch as of version 19.0.0 (dbc3e26c25587e5460ae12caed84cb09197c4ed7).

Consider the following loop:

#define ARRAY_SIZE 256

__attribute__((noinline)) void omp_simd_loop(float X[ARRAY_SIZE][ARRAY_SIZE]) {
    for (int r = 1; r < ARRAY_SIZE; ++r) {
        for (int c = 1; c < ARRAY_SIZE; ++c) {
#pragma omp simd
            for (int k = 2; k < ARRAY_SIZE; ++k) {
#pragma omp ordered simd
                X[r][k] = X[r][k - 2] + sinf((float)(r / c));
            }
        }
    }
}

We have that:

"2.13.8 ordered Construct: The ordered construct either specifies a structured block in a loop, simd, or loop SIMD region that will be executed in the order of the loop iterations, or it is a stand-alone directive that specifies cross-iteration dependences in a doacross loop nest. The ordered construct sequentializes and orders the execution of ordered regions while allowing code outside the region to run in parallel."

However, as of LLVM 18.1.0 when we:

We have 12,090 errors for the code compiled with LLVM 18.1.0 but 0 errors for the code compiled with LLVM 17.0.1.

Compiler Explorer repro:

The bug is only present when compiling with -fopenmp (compiling without -fopenmp makes LLVM 18.1.0 pass). Removing all #pragma omp also makes this pass. Using #pragma omp simd safelen(2) instead of #pragma omp simd is similarly sufficient: But this effectively makes #pragma omp ordered simd unnecessary. The above would strongly indicate this is an OpenMP issue. However, when attempting to track this down--and in particular analyze the interactions with different loop vectorizer decisions between LLVM 17.0.1 and 18.1.0--I've run into some "interesting" challenges (notes on the findings in the next comment to keep this one short).

This may be related to an earlier bug (although note that this one is a bit simpler in that it doesn't use printf inside the loop which currently prevents vectorization and thus does not reproduce for me at the time of writing):

[OpenMP 4.5] ORDERED SIMD construct in loop SIMD doesn't work as required by the specification https://github.com/llvm/llvm-project/issues/51043


Full repro source code (for completeness only: the aforementioned Compiler Explorer repros are identical):

#include <float.h>
#include <math.h>
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int compare_float(float x1, float x2, float scalar) {
    const float diff = fabsf(x1 - x2);
    x1 = fabsf(x1);
    x2 = fabsf(x2);
    const float l = (x2 > x1) ? x2 : x1;
    if (diff <= l * scalar * FLT_EPSILON)
        return 1;
    else
        return 0;
}

#define ARRAY_SIZE 256

__attribute__((noinline)) void initialization_loop(
    float X[ARRAY_SIZE][ARRAY_SIZE], float Y[ARRAY_SIZE][ARRAY_SIZE]) {
    const float max = 1000.0;
    srand(time(NULL));
    for (int r = 0; r < ARRAY_SIZE; r++) {
        for (int c = 0; c < ARRAY_SIZE; c++) {
            X[r][c] = ((float)rand() / (float)(RAND_MAX)) * max;
            Y[r][c] = X[r][c];
        }
    }
}

__attribute__((noinline)) void omp_simd_loop(float X[ARRAY_SIZE][ARRAY_SIZE]) {
    for (int r = 1; r < ARRAY_SIZE; ++r) {
        for (int c = 1; c < ARRAY_SIZE; ++c) {
#pragma omp simd
            for (int k = 2; k < ARRAY_SIZE; ++k) {
#pragma omp ordered simd
                X[r][k] = X[r][k - 2] + sinf((float)(r / c));
            }
        }
    }
}

__attribute__((noinline)) int comparison_loop(float X[ARRAY_SIZE][ARRAY_SIZE],
                                              float Y[ARRAY_SIZE][ARRAY_SIZE]) {
    int totalErrors_simd = 0;
    const float scalar = 1.0;
    for (int r = 1; r < ARRAY_SIZE; ++r) {
        for (int c = 1; c < ARRAY_SIZE; ++c) {
            for (int k = 2; k < ARRAY_SIZE; ++k) {
                Y[r][k] = Y[r][k - 2] + sinf((float)(r / c));
            }
        }
        // check row for simd update
        for (int k = 0; k < ARRAY_SIZE; ++k) {
            if (!compare_float(X[r][k], Y[r][k], scalar)) {
                ++totalErrors_simd;
            }
        }
    }
    return totalErrors_simd;
}

int main(void) {
    float X[ARRAY_SIZE][ARRAY_SIZE];
    float Y[ARRAY_SIZE][ARRAY_SIZE];

    initialization_loop(X, Y);
    omp_simd_loop(X);
    const int totalErrors_simd = comparison_loop(X, Y);

    if (totalErrors_simd) {
        fprintf(stdout, "totalErrors_simd: %d \n", totalErrors_simd);
        fprintf(stdout, "%s : %d - FAIL: error in ordered simd computation.\n",
                __FILE__, __LINE__);
    } else {
        fprintf(stdout, "Success!\n");
    }

    return totalErrors_simd;
}
llvmbot commented 3 months ago

@llvm/issue-subscribers-openmp

Author: Matt (MattPD)

This may be a regression between LLVM version 17.0.1 and 18.1.0. The issue is still present in the main branch as of version 19.0.0 (dbc3e26c25587e5460ae12caed84cb09197c4ed7). Consider the following loop: ```cpp #define ARRAY_SIZE 256 __attribute__((noinline)) void omp_simd_loop(float X[ARRAY_SIZE][ARRAY_SIZE]) { for (int r = 1; r < ARRAY_SIZE; ++r) { for (int c = 1; c < ARRAY_SIZE; ++c) { #pragma omp simd for (int k = 2; k < ARRAY_SIZE; ++k) { #pragma omp ordered simd X[r][k] = X[r][k - 2] + sinf((float)(r / c)); } } } } ``` We have that: > "2.13.8 ordered Construct: > The ordered construct either specifies a structured block in a loop, simd, or loop SIMD region that will be executed in the order of the loop iterations, or it is a stand-alone directive that specifies cross-iteration dependences in a doacross loop nest. The ordered construct sequentializes and orders the execution of ordered regions while allowing code outside the region to run in parallel." - OpenMP Application Programming Interface, Version 4.5, https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf However, as of LLVM 18.1.0 when we: - run the `omp_simd_loop` using `#pragma omp simd` and `#pragma omp ordered simd` - run the sequential `comparison_loop` (which is otherwise the same loop without any `#pragma omp`) - compare the results, counting the number of errors whenever the comparison fails (up to an including a rather large relative comparison tolerance of `1000000.0 * FLT_EPSILON`) We have 12,090 errors for the code compiled with LLVM 18.1.0 but 0 errors for the code compiled with LLVM 17.0.1. Compiler Explorer repro: - LLVM 18.1.0: https://godbolt.org/z/qbcecozbv - execution result: `totalErrors_simd: 12090`, `FAIL: error in ordered simd computation.` - LLVM 17.0.1: https://godbolt.org/z/oMv863fss - execution result: `Success!` The bug is only present when compiling with -fopenmp (compiling without -fopenmp makes LLVM 18.1.0 pass). Removing all `#pragma omp` also makes this pass. Using `#pragma omp simd safelen(2)` instead of `#pragma omp simd` is similarly sufficient: But this effectively makes `#pragma omp ordered simd` unnecessary. The above would strongly indicate this is an OpenMP issue. However, when attempting to track this down--and in particular analyze the interactions with different loop vectorizer decisions between LLVM 17.0.1 and 18.1.0--I've run into some "interesting" challenges (notes on the findings in the next comment to keep this one short). This may be related to an earlier bug (although note that this one is a bit simpler in that it doesn't use `printf` inside the loop which currently prevents vectorization and thus does not reproduce for me at the time of writing): [OpenMP 4.5] ORDERED SIMD construct in loop SIMD doesn't work as required by the specification https://github.com/llvm/llvm-project/issues/51043 --- Full repro source code (for completeness only: the aforementioned Compiler Explorer repros are identical): ```cpp #include <float.h> #include <math.h> #include <omp.h> #include <stdio.h> #include <stdlib.h> #include <time.h> int compare_float(float x1, float x2, float scalar) { const float diff = fabsf(x1 - x2); x1 = fabsf(x1); x2 = fabsf(x2); const float l = (x2 > x1) ? x2 : x1; if (diff <= l * scalar * FLT_EPSILON) return 1; else return 0; } #define ARRAY_SIZE 256 __attribute__((noinline)) void initialization_loop( float X[ARRAY_SIZE][ARRAY_SIZE], float Y[ARRAY_SIZE][ARRAY_SIZE]) { const float max = 1000.0; srand(time(NULL)); for (int r = 0; r < ARRAY_SIZE; r++) { for (int c = 0; c < ARRAY_SIZE; c++) { X[r][c] = ((float)rand() / (float)(RAND_MAX)) * max; Y[r][c] = X[r][c]; } } } __attribute__((noinline)) void omp_simd_loop(float X[ARRAY_SIZE][ARRAY_SIZE]) { for (int r = 1; r < ARRAY_SIZE; ++r) { for (int c = 1; c < ARRAY_SIZE; ++c) { #pragma omp simd for (int k = 2; k < ARRAY_SIZE; ++k) { #pragma omp ordered simd X[r][k] = X[r][k - 2] + sinf((float)(r / c)); } } } } __attribute__((noinline)) int comparison_loop(float X[ARRAY_SIZE][ARRAY_SIZE], float Y[ARRAY_SIZE][ARRAY_SIZE]) { int totalErrors_simd = 0; const float scalar = 1.0; for (int r = 1; r < ARRAY_SIZE; ++r) { for (int c = 1; c < ARRAY_SIZE; ++c) { for (int k = 2; k < ARRAY_SIZE; ++k) { Y[r][k] = Y[r][k - 2] + sinf((float)(r / c)); } } // check row for simd update for (int k = 0; k < ARRAY_SIZE; ++k) { if (!compare_float(X[r][k], Y[r][k], scalar)) { ++totalErrors_simd; } } } return totalErrors_simd; } int main(void) { float X[ARRAY_SIZE][ARRAY_SIZE]; float Y[ARRAY_SIZE][ARRAY_SIZE]; initialization_loop(X, Y); omp_simd_loop(X); const int totalErrors_simd = comparison_loop(X, Y); if (totalErrors_simd) { fprintf(stdout, "totalErrors_simd: %d \n", totalErrors_simd); fprintf(stdout, "%s : %d - FAIL: error in ordered simd computation.\n", __FILE__, __LINE__); } else { fprintf(stdout, "Success!\n"); } return totalErrors_simd; } ```
MattPD commented 3 months ago

Some notes on tracking this down and the observed differences between the two versions in terms of the loop vectorizer decisions (VF stands for vectorization factor, UF stands for unroll factor):

Experimenting with overriding VF,UF decisions of the loop vectorizer:

Thus, loop vectorization with VF=2,UF=2 works correctly for v17.0.1 (with the UF=2 being implicit rather than forced) but v18.1.0 doesn't (whether with -mllvm -force-vector-width=2 alone or with -mllvm -force-vector-width=2 -mllvm -force-vector-interleave=2 both).

Forcing VF=2,UF=1 using -mllvm -force-vector-width=2 -mllvm -force-vector-interleave=1 results in correct execution for v18.1.0.

While it initially seemed that the higher UF is the incorrect choice for v18.1.0, this doesn't explain why the forced UF=2 isn't a problem for v17.0.1.


Some notes on isolation:

Now, let's compare LLVM IR between v17.0.1 with -mllvm -force-vector-interleave=2 and v18.1.0 with -mllvm -force-vector-width=2 -mllvm -force-vector-interleave=2.

This way both versions are making the same vectorization decision, vectorized loop (vectorization width: 2, interleaved count: 2) so we can minimize any spurious differences and focus on the salient ones alone.

Command lines used:

https://godbolt.org/z/Ej5E4vrca

LLVM IR diff (LHS: v17.0.1, RHS: v18.1.0): https://editor.mergely.com/0z3MCX6t

The only obvious difference is that v17.0.1 has two more instructions in the omp.inner.for.cond.preheader basic block:

%broadcast.splatinsert26 = insertelement <2 x float> poison, float %1, i64 0, !dbg !20
%broadcast.splat27 = shufflevector <2 x float> %broadcast.splatinsert26, <2 x float> poison, <2 x i32> zeroinitializer, !dbg !20

The remaining differences do not appear significant:

Both versions produce identical assembly:

https://godbolt.org/z/5oKMhWeT5


Comparing v18.1.0 VF=2,UF=2 vs. VF=2,UF=1

Recall that:

LLVM IR does seem to indicate unrolling alone (in particular, vector.body gets an extra load, add, GEP, and store): https://editor.mergely.com/7U8bh2zu

There's a bit more complicated assembly (horizontal op, movlhps): https://editor.mergely.com/rcumi22m

Unclear whether there's anything problematic in this stage, unless the unrolling decision is incorrect.


Miscompiled function(s)

Recall that:

As both of the above make the same loop vectorization decision (vectorized loop (vectorization width: 2, interleaved count: 2)) this is the closest available baseline for comparison.

When compiling with -fopenmp (where the failure for 18.1.0 is present), the assembly code for initialization_loop and omp_simd_loop is identical: It's only the comparison_loop (which does not use any OpenMP pragmas) that differs:

LHS=17, RHS=18: https://editor.mergely.com/AQ8BXfRM

However, compiling without -fopenmp results in "Success!" for LLVM 18.1.0, too.

FWIW, still getting 2,210 errors even with the relative comparison tolerance changed to scalar = 1000000.0 (from 1.0) so it doesn't seem like a minor numerical difference, either.

In contrast, comparing LHS=18_without_fopenmp (passes) against RHS=18_with_openmp (fails):

https://editor.mergely.com/7i0QFyIx

This time only the assembly code for omp_simd_loop differs when comparing between LLVM 18.1.0 without -fopenmp (passing) and LLVM 18.1.0 with -fopenmp (failing).

But recall again that the assembly code for omp_simd_loop is exactly identical between LLVM 17.0.1 with -fopenmp (passing) and LLVM 18.1.0 with -fopenmp (failing).

Given all of the above, I still believe it's primarily an OpenMP issue on the grounds that it doesn't happen when compiling without -fopenmp but seeing identical assembly code for omp_simd_loop (which is the only function using OpenMP pragmas) when comparing passing LLVM 17.0.1 vs. failing LLVM 18.1.0 is quite a bit puzzling. It's quite possible that I've missed something so feel free to take all of the analysis with a grain of salt and consider the original bug report comment alone.

shiltian commented 2 months ago

This might be caused by front end changes because libomp has been quite stable for a long time.

MattPD commented 2 months ago

CC @kparzysz @alexey-bataev as you may be familiar with this part of the codebase