`memmove` is emitted where `memcpy` is valid

folkertdev commented 6 months ago

minimal example:

#include "inttypes.h";

typedef struct Foo {
    uint8_t x1[127];
} Foo;

void bad(Foo* ptr, int dst, int src) {
    Foo x = ptr[src];
    ptr[dst] = x;
}

void good(Foo* ptr, int dst, int src) {
    ptr[dst] = ptr[src];
}

The bad function in this example emits a memmove instead of the more optimal memcpy.

bad:                                    # @bad
        movsxd  rcx, edx
        mov     rax, rcx
        shl     rax, 7
        sub     rax, rcx
        add     rax, rdi
        movsxd  rcx, esi
        mov     rdx, rcx
        shl     rdx, 7
        sub     rdx, rcx
        add     rdi, rdx
        mov     edx, 127
        mov     rsi, rax
        jmp     memmove@PLT

On x86_64 this is merely inefficient, but for embedded targets, using memmove means that memmove needs to be included in the binary. It can consitute a significant percentage of the total binary size, so we'd prefer not to emit it. memcpy is generally unavoidable, and therefore part of the binary already.

For x86_64 using a large type (127 bytes) is required for this issue to become clear, for smaller types the memmove is inlined. But on 32-bit targets without vector instructions, even a 3-byte value will get memmoved: https://godbolt.org/z/3vhK34bvo.

The good function avoids this issue because the clang frontend emits a memcpy. In this example the memcpy is eventually inlined as a series of SSE instructions.

I'd argue that in the bad function, because the src and dst pointers are both getelementptr'd from the same pointer, the conditions of LLVM's memcpy are satisfied (the pointers don't alias OR src == dst), hence LLVM should in theory be able to emit a memcpy here.

Recognizing this pattern is more important for languages like rust, where array access (by default) involves a bounds check. rustc is not currently capable of emitting just a single memcpy like clang.

Explorer09 commented 6 months ago

Perhaps this is not a bug. For your bad() and good() functions, when (dst == src), i.e. same array offset, you get undefined behavior with memcpy.

https://stackoverflow.com/questions/43834781/behaviour-of-memcpy-when-pointers-are-identical

Perhaps the proper way to do it is this?

Foo *restrict dst_ptr = ptr + dst;
*dst_ptr = ptr[src];

folkertdev commented 6 months ago

The dst == src case is specifically allowed for the LLVM memcpy intrinsic, the docs note:

the ‘llvm.memcpy.*’ intrinsics copy a block of memory from the source location to the destination location, which must either be equal or non-overlapping.

if this were not the case, good would actually be UB because src and dst can be the same there, too.

For many libc implementations it is UB when src == dst, but llvm can work around that by just skipping the memcpy entirely when source and destination are the same.

Your suggestion does work (except that I think you need an explicit if (src != dst) check for it to be sound), but I think LLVM should automatically infer that memcpy can be used in this scenario.

Explorer09 commented 6 months ago

@folkertdev

The dst == src case is specifically allowed for the LLVM memcpy intrinsic, the docs note:

the ‘llvm.memcpy.*’ intrinsics copy a block of memory from the source location to the destination location, which must either be equal or non-overlapping.

if this were not the case, good would actually be UB because src and dst can be the same there, too.

From your Godbolt link, I didn't see good() emit any call to memcpy(), but a lot of SSE XMM "mov" instructions. "mov" instructions allow dest and source to be the same, which act as no-op. But if LLVM emits a memcpy() there, I would argue that it's a bug.

LLVM memcpy() intrinsic allows dest and source pointers to be the same only when compile time. Your tests require runtime check of (dst != src), so LLVM emitting memmove() instead of memcpy() is a correct behavior. (Especially for "-Os" optimization.)

Explorer09 commented 6 months ago

The OP's example should be changed to this to demonstrate the bug:

void bad2(Foo* ptr, int dst, int src) {
    if (dst == src)
        return;
    Foo x = ptr[src];
    ptr[dst] = x;
}

Here, memcpy would be safe, because I explicitly filter out the (dst == src) case.

folkertdev commented 6 months ago

on x86_64 the memcpy gets inlined. But if you look at the LLVM passes, the memcpy is preserved until the very end: it is never converted into a memmove.

On a 32-bit target without simd instructions, the problem is clearer. Consider https://godbolt.org/z/6c7Waq9x9:

In this case the emitted assembly is

bad:
        push    {r7, lr}
        mov     r7, sp
        bl      OUTLINED_FUNCTION_0
        bl      __aeabi_memmove
        pop     {r7, pc}
good:
        str     lr, [sp, #-8]!
        bl      OUTLINED_FUNCTION_0
        ldr     lr, [sp], #8
        b       __aeabi_memcpy
OUTLINED_FUNCTION_0:
        add.w   r1, r1, r1, lsl #1
        adds    r3, r0, r1
        add.w   r1, r2, r2, lsl #1
        movs    r2, #3
        add     r1, r0
        mov     r0, r3
        bx      lr

you can clearly see that good calls memcpy, but bad calls memmove. The LLVM passes confirm that LLVM continues to consider good a memcpy until code is emitted.

I also believe that this invalidates your claim that the check for src != dst needs to be made in the source code, otherwise goods body of ptr[dst] = ptr[src] would be UB! surely clang would not emit such code if it were UB?

My point is ultimately that good and bad should produce the same code. If memcpy is valid for good, it should also be valid for bad.

Explorer09 commented 6 months ago

@folkertdev My position is that if your code is not portable at the start, you should fix your code first and not blame LLVM for missed optimization.

In this particular case, it's just lucky that LLVM can inline the memcpy code for you, but technically this is a link time optimization that you shouldn't rely on.

If memcpy was dynamically linked, LLVM (and other compilers) won't be able to assume it's safe to pass (dest == src) to memcpy at any time.

You should make the check of (dest == src) in the source code.

folkertdev commented 6 months ago

I still don't see why it's valid then for clang to emit the memcpy for good. Why does that not cause UB in the dynamic linking scenario?

But in any case, even with the if (dst == src) return; this does not optimize which is disappointing. This check does not seem to be considered by the checks in MemCpyOptPass when deciding between a memcpy and a memmove. Is it feasible to propagate that information?

folkertdev commented 6 months ago

also I think in general

the ‘llvm.memcpy.*’ intrinsics copy a block of memory from the source location to the destination location, which must either be equal or non-overlapping.

really does suggest that the source and destination can be the same at runtime. It says nothing about being the same "at compile time" (by which you means something like memcpy(x, x, 1)? i.e. source and destination are literally the same value?). So if source being equal to destination at runtime, is not allowed, I think the wording should be clarified.

Explorer09 commented 6 months ago

also I think in general

the ‘llvm.memcpy.*’ intrinsics copy a block of memory from the source location to the destination location, which must either be equal or non-overlapping.

really does suggest that the source and destination can be the same at runtime. It says nothing about being the same "at compile time" (by which you means something like memcpy(x, x, 1)? i.e. source and destination are literally the same value?). So if source being equal to destination at runtime, is not allowed, I think the wording should be clarified.

No. The 'llvm.memcpy.*' intrinsic is nothing to do with your optimize case. It only matters when you invoke the intrinsic directly, and not when it is emitted as part of the optimization. Because ultimately it is libc's memcpy that would be invoked in the generated code. The additional allowed behavior of 'llvm.memcpy' would become just a compile time "sugar".

Think of the memcpy intrinsic like a wrapper function to libc's memcpy...

inline void* memcpy_wrapper(void *dest, const void *src, size_t n) {
  if (dest != src)
    memcpy(dest, src, n);
}

... that allows the function to translate to no-op when needed. You might think that compiler might generate (dest != src) runtime check, but this usually results in a larger code than simply a memmove call (dynamically linked), thus the compiler might choose not to do it.

The compiler can't know your runtime constraints unless you tell it. For your use case in particular, the compiler doesn't know the (dest != src) constraint, thus it's not a bug when it generates memmove and not memcpy as you wished.

diondokter commented 6 months ago

If you change the opt-level to -Oz then the compiler does generate a memcpy on the good function even though src might be equal to dst.

If I understand what's discussed here correctly, then either @folkertdev is right and LLVM misses an optimization or @Explorer09 is right and LLVM has a bug? I might not have understood something though

diondokter commented 6 months ago

I think this is the code that makes the decision to change to a memmove: https://github.com/llvm/llvm-project/blob/96819daa3d095cf9f662e0229dc82eaaa25480e8/llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp#L1161-L1174

What I think is weird is that this check is done here, when we optimize two memcpys into one. This check isn't run on all memcpys, but only when we try to optimize this way.

Meanwhile all memmoves are changed to memcpy if they don't alias here: https://github.com/llvm/llvm-project/blob/96819daa3d095cf9f662e0229dc82eaaa25480e8/llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp#L1788-L1810

Explorer09 commented 6 months ago

If you change the opt-level to -Oz then the compiler does generate a memcpy on the good function even though src might be equal to dst.

If I understand what's discussed here correctly, then either @folkertdev is right and LLVM misses an optimization or @Explorer09 is right and LLVM has a bug? I might not have understood something though

Just to say that I didn't understand the whole picture either. It could be a bug in LLVM when it generates memcpy where src may equal to dst. However... it might be not a bug when LLVM can consider it safe to call memcpy when it can analyze the whole function body (required for inlining). The UB of memcpy is by the standard, but in the implementation level, it can be fine for src == dst and such a memmove-to-memcpy optimization becomes a "feature". I personally think that the good() function case is an accident – that using memcpy could be safe for this particular libc implementation. But this is more of an exception than a rule.

efriedma-quic commented 6 months ago

Putting aside the whole "memcpy with src==dst" thing (LangRef is clear here, I think), what's required for this transform is proving that src != dst implies the source and destination don't partially overlap. Consider the following variant:

void no_optimize(char* ptr, int dst, int src) {
    Foo x = *(Foo*)(ptr + src);
    *(Foo*)(ptr + dst) = x;
}

This would be illegal to optimize to memcpy, so we need to distinguish between this case and the case you want to optimize. That should be possible, but I don't think we currently have any code which implements that sort of proof.

Explorer09 commented 6 months ago

... what's required for this transform is proving that src != dst implies the source and destination don't partially overlap.

Thanks to @efriedma-quic . I can demonstrate the bug now.

void test_func(Foo *ptr, int dst, int src) {
  *(Foo *)((char *)ptr + dst) = *(Foo *)((char *)ptr + src);
}

With "-Oz" option this generates a call to memcpy which is UB when two offsets partially overlap.

EDIT: Correction. It's still not a bug as it's UB on the C/C++ language, not a fault in LLVM's side. Still, a compile warning about UB would be better.

folkertdev commented 6 months ago

@efriedma-quic said:

so we need to distinguish between this case and the case you want to optimize. That should be possible, but I don't think we currently have any code which implements that sort of proof.

I think the vast majority of actual cases is covered by just considering (what in the source language was) array indexing, of the form myArray[index]. In that case if we have a = getelementptr myArray x and b = getelementptr myArray y then these cannot overlap except if x == y which is explicitly allowed in LLVM. I think this basic pattern covers a lot of what e.g. rust emits.

efriedma-quic commented 6 months ago

EDIT: Correction. It's still not a bug as it's UB on the C/C++ language, not a fault in LLVM's side. Still, a compile warning about UB would be better.

The static analyzer should be able to catch this sort of thing; if it doesn't, maybe worth filing a bug.

In that case if we have a = getelementptr myArray x and b = getelementptr myArray y then these cannot overlap except if x == y

That's the right idea. But we need to generalize this to also work with arithmetic, so we don't have to rewrite the code for ptradd (see https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/).

llvm / llvm-project

`memmove` is emitted where `memcpy` is valid #86499