Open saethlin opened 2 years ago
This smells like a duplicate of https://github.com/rust-lang/rust/issues/44099
@rustbot label +A-codegen +I-slow
I just got bitten by this suboptimal codegen on a production project, and made a detailed write-up with performance analysis for my case here.
From my experience, calling memcpy
for small buffers is especially bad for Linux musl
targets, which have a significantly less optimized memcpy
implementation than glibc
. If any sort of buffered I/O operation or small slice copying is involved in the hot path of some piece of code (which potentially includes writing to any sort of in-memory buffer or BufWriter
), runtime regressions of ~10% or more are to be expected just because memcpy
implementation differences.
This is a fairly old problem that has been partially addressed with small-copy optimizations on the Read
and Write
implementation for slices, but even such optimizations fall back to the slow path for copies larger than one byte, and it'd be great to see some progress being made on this.
comparing a fixed-size copy and unknown-but-small-size copy, including llvm opt remarks: https://godbolt.org/z/MacG1Yrx3
even in the case of a fixed-size copy there's a memcpy
for at least a moment:
note: <source>:16:9 loop-idiom (success): Formed a call to llvm.memcpy.p0.p0.i64() intrinsic
from load and store instruction in _ZN7example23small_copy_fixed_amount17ha5560e32ccb9b385E function
so in the typical case we're pretty dependent on LLVM deciding to turn calls to the intrinsic memcpy back into loads and stores!
that, in turn, seems to happen in one of three places, two of which require constant-size len
for memcpy
, and the third is fairly target-specific. that seems to be why this remains memcpy
through to a libc call. either:
InstCombine
sees that the copy is 1/2/4/8 bytes and replaces with a load/store then and there in LLVM IR or,SelectionDAG
sees the copy is smaller than some heuristic limit and replaces memcpy with a series of load/store as lots of Rust generally benefits from or,SelectionDAG
target-specific code generates something for a memcpy. for x86 targets this means if the target supports FSRM or the copy is constant-size, rep movs
can be profitable. and hey, fsrm yields just that: more godbolt.fsrm in particular is an interesting detail, because for processors supporting fast rep mov
it's likely faster than a loop. this LKML patch referencing it includes a handy dandy graph comparing a loop vs rep mov
for copies up to 31 bytes long (assuming src and dest are not in cache. i'm curious how this holds up when they are.)
that seems like an indication that the "best" thing to do here is improve LLVM to handle the memcpy better, rather than avoiding the memcpy in LLVM entirely.
(i recognize most of the above is more suitable for an LLVM issue, but prevalence of the memcpy intrinsic in code that optimizes well was a surprise!)
a contributing factor to this being more noticeable in Rust is noalias
. loop-idom
doesn't fire to transform load-store loops into memcpy if source and dest can alias.
once you know that it's very easy to replicate this with clang (clang godbolt), or "fix" this by using rust types that might alias (rustc godbolt).
godbolt demo
I have a
[u8; 4]
and I want to copy an unknown number of bytes from it into a&mut [u8]
. Based on benchmarking, on x86_64 at least it is much slower to callmemcpy
than it is to do a byte-by-byte copy. The fastest implementation of this pattern in Rust is this:That just doesn't seem right to me.