avoid libcall to memcpy harder

nickdesaulniers commented 2 years ago

Consider the following example:

struct foo {
    unsigned long x0;
    unsigned long x1;
    unsigned long x2;
    unsigned long x3;
    unsigned long x4;
    unsigned long x5;
    unsigned long x6;
    unsigned long x7;
    unsigned long x8;
    unsigned long x9;
    unsigned long x10;
    unsigned long x11;
    unsigned long x12;
    unsigned long x13;
    unsigned long x14;
    unsigned long x15;
    // Comment out below members.
    unsigned long x16;
    unsigned long x17;
    unsigned long x18;
    unsigned long x19;
} *x, *y;

struct foo* get_x(void);

struct foo* cpy(struct foo *y) {
    struct foo *x = get_x();
    if (y != x)
        *x = *y;
    return x;
}

compiled with -O2 -mno-sse (as the Linux kernel does), we get:

cpy:
  ...
        movl    $160, %edx
        movq    %rbx, %rdi
        movq    %r14, %rsi
        callq   memcpy@PLT
...

but if we reduce the number of members in struct foo, we can get:

cpy:
  ...
        movl    $16, %ecx
        movq    %rax, %rdi
        movq    %rbx, %rsi
        rep;movsq (%rsi), %es:(%rdi)
...

which is going to be way faster. FWICT, it looks like isel is choosing whether to lower @llvm.memcpy.p0i8.p0i8.i64() to a libcall to memcpy vs inline a simple memcpy.

I assume there's some limit on how many bytes rep;movsq can copy, but surely it's much larger than 16x8B?

cc @phoebewang

llvmbot commented 2 years ago

@llvm/issue-subscribers-backend-x86

efriedma-quic commented 2 years ago

https://github.com/llvm/llvm-project/blob/a9b70a8b7b373fba340a2b6abf85032f028aac5c/llvm/lib/Target/X86/X86Subtarget.h#L81

The value "128" was chosen 16 years ago in https://github.com/llvm/llvm-project/commit/03c1e6f48e8e7a218a4db1b6ee455b79503c61fa . Maybe the correct default has changed since then. :)

phoebewang commented 2 years ago

It seems only icelake and later targets have fast rep: https://reviews.llvm.org/D85989 We already have patches for the replacement: https://reviews.llvm.org/D86883 https://godbolt.org/z/ovEf9Kxfz

LebedevRI commented 2 years ago

I've actually thought about similar problem. Let me at least come up with a benchmark (-mllvm -x86-use-fsrm-for-memcpy will simplify that :))

LebedevRI commented 2 years ago

Ok, got the benchmark-ish: benchmark_memcpy.cc.txt

On Zen3:

For fully unaligned pointers, memcpy always wins: res-align1.txt. This concludes my interest.

align 2 likewise: res-align2.txt

RKSimon commented 2 years ago

CC @RKSimon

bdaase commented 1 year ago

llvm-project/llvm/lib/Target/X86/X86Subtarget.h

Line 81 in a9b70a8

/// Max. memset / memcpy size that is turned into rep/movs, rep/stos ops. The value "128" was chosen 16 years ago in https://github.com/llvm/llvm-project/commit/03c1e6f48e8e7a218a4db1b6ee455b79503c61fa . Maybe the correct default has changed since then. :)

I am a bit surprised that the MaxInlineSizeThreshold is actually 128, because my experiments indicate that it stops inlining the memcpy at 256 Bytes: https://godbolt.org/z/j4qaTvjb7

In the above example, the memcpy is inligned, even though one field is 16 * 16 = 256 Byte. Uncommenting either line 21 or changing the 16 to a 17 in line 39 makes it call memcpy.

llvm / llvm-project

avoid libcall to memcpy harder #54535