New assert: !(TSFlags & X86II::REX_W) && "REX.W requires 64bit mode."

zmodem commented 6 years ago


Bugzilla Link	36346
Version	trunk
OS	Linux
CC	@topperc,@RKSimon,@rotateright

Extended Description

We're seeing this in Chromium builds, see https://bugs.chromium.org/p/chromium/issues/detail?id=811241

Bisection points to this commit:

Author: lsaba Date: Sun Feb 11 01:34:12 2018 New Revision: 324835

URL: http://llvm.org/viewvc/llvm-project?rev=324835&view=rev Log: [X86] Reduce Store Forward Block issues in HW

If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory. A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load. The estimated penalty for a store forward block is ~13 cycles.

This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence of a load and a store.

The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies. breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.

llvmbot commented 6 years ago

Reproducer for additional case based on a case reported by Richard Smith:

struct S {
      int a;
      int b;
      int c;
      int d;
    };
    void test_overlap_1(char *A, int x) {
      struct S *s1 = (struct S *)(A - 16);
      *(int *)(A - 8) = 7;
     struct S *s3 = (struct S *)(A);
     *s3 = *s1;
     *(long *)(A - 9) = x;
     *(long *)(A - 16) = x;
     *(char *)(A - 1) = 0;
     struct S *s2 = (struct S *)(A + 16);
     *s2 = *s1;
   }

The code produces extra stores for overlapping areas, which looks something like this:

        movl    $7, -8(%rdi)
        movq    -16(%rdi), %rax
        movq    %rax, (%rdi)
        movl    -8(%rdi), %eax
        movl    %eax, 8(%rdi)
        movl    -4(%rdi), %eax
        movl    %eax, 12(%rdi)
        movslq  %esi, %rax
        movq    %rax, -9(%rdi)
        movq    %rax, -16(%rdi)
        movb    $0, -1(%rdi)
        movq    -16(%rdi), %rax
        movq    %rax, 16(%rdi)
        movb    -8(%rdi), %al
        movb    %al, 24(%rdi)
        movq    -9(%rdi), %rax
        movq    %rax, 23(%rdi)
        movl    -1(%rdi), %eax
        movl    %eax, 31(%rdi)
        movzwl  3(%rdi), %eax
        movw    %ax, 35(%rdi) //BAD STORE
        movb    5(%rdi), %al
        movb    %al, 37(%rdi) //BAD STORE
        movl    -8(%rdi), %eax
        movl    %eax, 24(%rdi)
        movzwl  -4(%rdi), %eax
        movw    %ax, 28(%rdi)
        movb    -2(%rdi), %al
        movb    %al, 30(%rdi)
        movb    -1(%rdi), %al
        movb    %al, 31(%rdi)
        retq

zmodem commented 6 years ago

In some of Chromium's 64-bit builds, we're seeing test failures that started around the same time frame as this change, for example: https://ci.chromium.org/buildbot/chromium.clang/ToTWin64/920

So besides the assert, I suspect this is causing miscompiles.

The miscompile is discussed on the commit thread on llvm-commits: http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20180212/524909.html

zmodem commented 6 years ago

In some of Chromium's 64-bit builds, we're seeing test failures that started around the same time frame as this change, for example: https://ci.chromium.org/buildbot/chromium.clang/ToTWin64/920

So besides the assert, I suspect this is causing miscompiles.

zmodem commented 6 years ago

creduced repro:

$ clang -cc1 -triple i686--linux-android -emit-obj -target-feature +sse2 -O2 -vectorize-slp -x c++ a.ii

(It works with other triples too, e.g. i686-pc-win32)

struct a {
  float b;
  float c;
};
struct d {
  float e;
};
void f(a[]);
void g(d &j) {
  a h[1];
  f(h);
  a i[3];
  f(i);
  i[2].c = j.e;
  h[1] = i[1];
  h[2] = i[2];
}

zmodem commented 6 years ago

I've reverted in r324887.

Will try to produce a reduced repro.

llvm / llvm-project

New assert: !(TSFlags & X86II::REX_W) && "REX.W requires 64bit mode." #35694

Extended Description

The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies. breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.