Open folkertdev opened 6 months ago
Perhaps this is not a bug. For your bad()
and good()
functions, when (dst == src)
, i.e. same array offset, you get undefined behavior with memcpy
.
https://stackoverflow.com/questions/43834781/behaviour-of-memcpy-when-pointers-are-identical
Perhaps the proper way to do it is this?
Foo *restrict dst_ptr = ptr + dst;
*dst_ptr = ptr[src];
The dst == src
case is specifically allowed for the LLVM memcpy intrinsic, the docs note:
the ‘llvm.memcpy.*’ intrinsics copy a block of memory from the source location to the destination location, which must either be equal or non-overlapping.
if this were not the case, good
would actually be UB because src
and dst
can be the same there, too.
For many libc implementations it is UB when src == dst
, but llvm can work around that by just skipping the memcpy entirely when source and destination are the same.
Your suggestion does work (except that I think you need an explicit if (src != dst)
check for it to be sound), but I think LLVM should automatically infer that memcpy
can be used in this scenario.
@folkertdev
The
dst == src
case is specifically allowed for the LLVM memcpy intrinsic, the docs note:the ‘llvm.memcpy.*’ intrinsics copy a block of memory from the source location to the destination location, which must either be equal or non-overlapping.
if this were not the case,
good
would actually be UB becausesrc
anddst
can be the same there, too.
From your Godbolt link, I didn't see good()
emit any call to memcpy(), but a lot of SSE XMM "mov" instructions. "mov" instructions allow dest and source to be the same, which act as no-op. But if LLVM emits a memcpy() there, I would argue that it's a bug.
LLVM memcpy() intrinsic allows dest and source pointers to be the same only when compile time. Your tests require runtime check of (dst != src)
, so LLVM emitting memmove() instead of memcpy() is a correct behavior. (Especially for "-Os" optimization.)
The OP's example should be changed to this to demonstrate the bug:
void bad2(Foo* ptr, int dst, int src) {
if (dst == src)
return;
Foo x = ptr[src];
ptr[dst] = x;
}
Here, memcpy
would be safe, because I explicitly filter out the (dst == src)
case.
on x86_64 the memcpy gets inlined. But if you look at the LLVM passes, the memcpy is preserved until the very end: it is never converted into a memmove.
On a 32-bit target without simd instructions, the problem is clearer. Consider https://godbolt.org/z/6c7Waq9x9:
In this case the emitted assembly is
bad:
push {r7, lr}
mov r7, sp
bl OUTLINED_FUNCTION_0
bl __aeabi_memmove
pop {r7, pc}
good:
str lr, [sp, #-8]!
bl OUTLINED_FUNCTION_0
ldr lr, [sp], #8
b __aeabi_memcpy
OUTLINED_FUNCTION_0:
add.w r1, r1, r1, lsl #1
adds r3, r0, r1
add.w r1, r2, r2, lsl #1
movs r2, #3
add r1, r0
mov r0, r3
bx lr
you can clearly see that good
calls memcpy
, but bad
calls memmove
. The LLVM passes confirm that LLVM continues to consider good
a memcpy until code is emitted.
I also believe that this invalidates your claim that the check for src != dst
needs to be made in the source code, otherwise good
s body of ptr[dst] = ptr[src]
would be UB! surely clang would not emit such code if it were UB?
My point is ultimately that good
and bad
should produce the same code. If memcpy is valid for good
, it should also be valid for bad
.
@folkertdev My position is that if your code is not portable at the start, you should fix your code first and not blame LLVM for missed optimization.
In this particular case, it's just lucky that LLVM can inline the memcpy
code for you, but technically this is a link time optimization that you shouldn't rely on.
If memcpy
was dynamically linked, LLVM (and other compilers) won't be able to assume it's safe to pass (dest == src)
to memcpy
at any time.
You should make the check of (dest == src)
in the source code.
I still don't see why it's valid then for clang to emit the memcpy for good
. Why does that not cause UB in the dynamic linking scenario?
But in any case, even with the if (dst == src) return;
this does not optimize which is disappointing. This check does not seem to be considered by the checks in MemCpyOptPass
when deciding between a memcpy and a memmove. Is it feasible to propagate that information?
also I think in general
the ‘llvm.memcpy.*’ intrinsics copy a block of memory from the source location to the destination location, which must either be equal or non-overlapping.
really does suggest that the source and destination can be the same at runtime. It says nothing about being the same "at compile time" (by which you means something like memcpy(x, x, 1)
? i.e. source and destination are literally the same value?). So if source being equal to destination at runtime, is not allowed, I think the wording should be clarified.
also I think in general
the ‘llvm.memcpy.*’ intrinsics copy a block of memory from the source location to the destination location, which must either be equal or non-overlapping.
really does suggest that the source and destination can be the same at runtime. It says nothing about being the same "at compile time" (by which you means something like
memcpy(x, x, 1)
? i.e. source and destination are literally the same value?). So if source being equal to destination at runtime, is not allowed, I think the wording should be clarified.
No. The 'llvm.memcpy.*' intrinsic is nothing to do with your optimize case. It only matters when you invoke the intrinsic directly, and not when it is emitted as part of the optimization. Because ultimately it is libc's memcpy
that would be invoked in the generated code. The additional allowed behavior of 'llvm.memcpy' would become just a compile time "sugar".
Think of the memcpy
intrinsic like a wrapper function to libc's memcpy
...
inline void* memcpy_wrapper(void *dest, const void *src, size_t n) {
if (dest != src)
memcpy(dest, src, n);
}
... that allows the function to translate to no-op when needed. You might think that compiler might generate (dest != src)
runtime check, but this usually results in a larger code than simply a memmove
call (dynamically linked), thus the compiler might choose not to do it.
The compiler can't know your runtime constraints unless you tell it. For your use case in particular, the compiler doesn't know the (dest != src)
constraint, thus it's not a bug when it generates memmove
and not memcpy
as you wished.
If you change the opt-level to -Oz
then the compiler does generate a memcpy
on the good
function even though src
might be equal to dst
.
If I understand what's discussed here correctly, then either @folkertdev is right and LLVM misses an optimization or @Explorer09 is right and LLVM has a bug? I might not have understood something though
I think this is the code that makes the decision to change to a memmove: https://github.com/llvm/llvm-project/blob/96819daa3d095cf9f662e0229dc82eaaa25480e8/llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp#L1161-L1174
What I think is weird is that this check is done here, when we optimize two memcpy
s into one.
This check isn't run on all memcpy
s, but only when we try to optimize this way.
Meanwhile all memmove
s are changed to memcpy
if they don't alias here: https://github.com/llvm/llvm-project/blob/96819daa3d095cf9f662e0229dc82eaaa25480e8/llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp#L1788-L1810
If you change the opt-level to
-Oz
then the compiler does generate amemcpy
on thegood
function even thoughsrc
might be equal todst
.If I understand what's discussed here correctly, then either @folkertdev is right and LLVM misses an optimization or @Explorer09 is right and LLVM has a bug? I might not have understood something though
Just to say that I didn't understand the whole picture either. It could be a bug in LLVM when it generates memcpy
where src
may equal to dst
. However... it might be not a bug when LLVM can consider it safe to call memcpy
when it can analyze the whole function body (required for inlining). The UB of memcpy
is by the standard, but in the implementation level, it can be fine for src == dst
and such a memmove
-to-memcpy
optimization becomes a "feature". I personally think that the good()
function case is an accident – that using memcpy
could be safe for this particular libc implementation. But this is more of an exception than a rule.
Putting aside the whole "memcpy with src==dst" thing (LangRef is clear here, I think), what's required for this transform is proving that src != dst implies the source and destination don't partially overlap. Consider the following variant:
void no_optimize(char* ptr, int dst, int src) {
Foo x = *(Foo*)(ptr + src);
*(Foo*)(ptr + dst) = x;
}
This would be illegal to optimize to memcpy, so we need to distinguish between this case and the case you want to optimize. That should be possible, but I don't think we currently have any code which implements that sort of proof.
... what's required for this transform is proving that src != dst implies the source and destination don't partially overlap.
Thanks to @efriedma-quic . I can demonstrate the bug now.
void test_func(Foo *ptr, int dst, int src) {
*(Foo *)((char *)ptr + dst) = *(Foo *)((char *)ptr + src);
}
With "-Oz
" option this generates a call to memcpy
which is UB when two offsets partially overlap.
EDIT: Correction. It's still not a bug as it's UB on the C/C++ language, not a fault in LLVM's side. Still, a compile warning about UB would be better.
@efriedma-quic said:
so we need to distinguish between this case and the case you want to optimize. That should be possible, but I don't think we currently have any code which implements that sort of proof.
I think the vast majority of actual cases is covered by just considering (what in the source language was) array indexing, of the form myArray[index]
. In that case if we have a = getelementptr myArray x
and b = getelementptr myArray y
then these cannot overlap except if x == y
which is explicitly allowed in LLVM. I think this basic pattern covers a lot of what e.g. rust emits.
EDIT: Correction. It's still not a bug as it's UB on the C/C++ language, not a fault in LLVM's side. Still, a compile warning about UB would be better.
The static analyzer should be able to catch this sort of thing; if it doesn't, maybe worth filing a bug.
In that case if we have a = getelementptr myArray x and b = getelementptr myArray y then these cannot overlap except if x == y
That's the right idea. But we need to generalize this to also work with arithmetic, so we don't have to rewrite the code for ptradd (see https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/).
minimal example:
https://godbolt.org/z/baK6b13Ga
The
bad
function in this example emits a memmove instead of the more optimal memcpy.On x86_64 this is merely inefficient, but for embedded targets, using memmove means that memmove needs to be included in the binary. It can consitute a significant percentage of the total binary size, so we'd prefer not to emit it. memcpy is generally unavoidable, and therefore part of the binary already.
For x86_64 using a large type (127 bytes) is required for this issue to become clear, for smaller types the memmove is inlined. But on 32-bit targets without vector instructions, even a 3-byte value will get
memmove
d: https://godbolt.org/z/3vhK34bvo.The
good
function avoids this issue because the clang frontend emits a memcpy. In this example the memcpy is eventually inlined as a series of SSE instructions.I'd argue that in the
bad
function, because thesrc
anddst
pointers are bothgetelementptr
'd from the same pointer, the conditions of LLVM's memcpy are satisfied (the pointers don't alias OR src == dst), hence LLVM should in theory be able to emit a memcpy here.Recognizing this pattern is more important for languages like rust, where array access (by default) involves a bounds check.
rustc
is not currently capable of emitting just a single memcpy like clang.