Open nickdesaulniers opened 2 years ago
@llvm/issue-subscribers-backend-x86
The value "128" was chosen 16 years ago in https://github.com/llvm/llvm-project/commit/03c1e6f48e8e7a218a4db1b6ee455b79503c61fa . Maybe the correct default has changed since then. :)
It seems only icelake and later targets have fast rep: https://reviews.llvm.org/D85989 We already have patches for the replacement: https://reviews.llvm.org/D86883 https://godbolt.org/z/ovEf9Kxfz
I've actually thought about similar problem.
Let me at least come up with a benchmark (-mllvm -x86-use-fsrm-for-memcpy
will simplify that :))
Ok, got the benchmark-ish: benchmark_memcpy.cc.txt
On Zen3:
For fully unaligned pointers, memcpy
always wins: res-align1.txt. This concludes my interest.
align 2 likewise: res-align2.txt
CC @RKSimon
llvm-project/llvm/lib/Target/X86/X86Subtarget.h
Line 81 in a9b70a8
/// Max. memset / memcpy size that is turned into rep/movs, rep/stos ops. The value "128" was chosen 16 years ago in https://github.com/llvm/llvm-project/commit/03c1e6f48e8e7a218a4db1b6ee455b79503c61fa . Maybe the correct default has changed since then. :)
I am a bit surprised that the MaxInlineSizeThreshold
is actually 128, because my experiments indicate that it stops inlining the memcpy
at 256 Bytes: https://godbolt.org/z/j4qaTvjb7
In the above example, the memcpy
is inligned, even though one field
is 16 * 16 = 256 Byte.
Uncommenting either line 21 or changing the 16 to a 17 in line 39 makes it call memcpy
.
Via this thread:
Consider the following example:
compiled with
-O2 -mno-sse
(as the Linux kernel does), we get:but if we reduce the number of members in
struct foo
, we can get:which is going to be way faster. FWICT, it looks like isel is choosing whether to lower
@llvm.memcpy.p0i8.p0i8.i64()
to a libcall to memcpy vs inline a simple memcpy.I assume there's some limit on how many bytes rep;movsq can copy, but surely it's much larger than 16x8B?
cc @phoebewang