Closed safinaskar closed 1 year ago
@rustbot label +I-slow +T-compiler +WG-codegen +A-codegen
I doubt that the rather minor assembly changes could result in such a huge difference. Godbolt might not be representative in this case, maybe due to a difference in optimization flags.
The first example spends time in the kernel while the second one doesn't. That could indicate that calls to the allocator were optimized out in one case (making this a benchmarking artifact) and not in the other.
Try running it under a profiler or objdumping the assembly and diffing those.
I reduced the first example to this:
pub fn main() {
let mut hash_vec = vec![];
for n in 0..100 {
let buf = vec![0u8; 4194304];
for _ in 0..5 {
std::hint::black_box(buf.clone());
}
let h = vec![0u8, 0u8];
drop(buf);
hash_vec.push(h);
}
std::hint::black_box(hash_vec);
}
Moving the drop(buf)
to after the push produces perf stat
output that matches the fast version. The difference in drop order is hard to see in the original code because buf
is moved into the if let
, so its drop location is not easy to see in relation to the push.
Based on profiling, the difference seems attributable to a 44x (!!) difference in the number of page faults between the two implementations. If I swap in jemalloc or mimalloc, the difference in runtime and page faults goes away. So I strongly suspect that this code is generating some worst-case behavior in glibc's allocator.
If you have an allocation-intensive workload it is always good advice to use a non-default allocator. This effect is surprisingly dramatic, but I don't think there's anything for us to do here.
I added Vec::with_capacity
to my actual code and now everything is fast. So, problem is solved for me!
Also, I redirected the bug to glibc (I rewrote @saethlin's code to C): https://sourceware.org/bugzilla/show_bug.cgi?id=30625 , so I'm closing the bug here
Merely moving line
hash_vec.push(h)
from one place to another slows code down 1.6x. This is very unfortunate, because this prevents me from converting my sequential code to rayon (I will explain why). Also, all these problems appeared in real production code base, I will explain this.So, here is two code examples, I will call them UPLMUT (for lack of better names):
The only difference is placement of
hash_vec.push(h)
. Yet "UPLMUT slow" is slower than "UPLMUT fast" 1.6 times. Here is result of running "UPLMUT slow" 10 times and then "UPLMUT fast" 10 times (usingtime -p
):Test system is Debian Sid Linux x86_64 with
rustc 1.72.0-nightly (f7ca9df69 2023-06-24)
running in Docker container in Linux 5.10.0.Here is godbolt with assembly diff: https://godbolt.org/z/orT9oPcsr .
You may say something like this: "well, you just call
malloc
andfree
in different order, so there is no surprise performance is different". No! Look carefully at this code. This two sources perform the same algorithm. I. e. merely optimization should transform one to another. So I'm nearly sure there is a bug in rustc, not inmalloc
, not in some other place.You may say: "This is artificial example. Also, if
hash_vec.push(h)
changes performance, just put it in proper place and everything will be fast". No! This appeared in my real code, and I will explain.Here is version, which is slightly closer to my real code. I will call it YLAURR.
In this case splitting
for_each(...)
tomap(...).for_each(...)
makes code significantly slower. You may again say that this is artificial example or you may say "Just don't splitfor_each
tomap.for_each
".So, again, here is example even closer to my real code. I will call it WWUUIJ.
Pipe to program's stdin 2 GiB of /dev/urandom while testing.
If the example still looks artificial, okay, here is real story behind all this. I'm trying to write my replacement for borg and casync. I wrote about it here: https://github.com/borgbackup/borg/issues/7674#issuecomment-1607612493 . My program supports two operations: "add" and "extract". Operation "add" splits input file to chunks and then for every chunk compresses it, writes it to its own "chunk file" and also adds its hash to (single) "index file". So programs algorithm can be described using such pseudocode:
Now I decided to parallelize the program using rayon. As a first step I split above closure to two parts: one parallelizable and one sequential:
Unfortunately, such splitting surprisingly made the program slower! So, first step of converting the program to rayon (to make it faster!) made it slower. This is very bad. I started to investigate why it is so. So I reduced the problem to code WWUUIJ, then to YLAURR and finally to UPLMUT. So, I think core problem is in rustc. And I hope I convinced you that this bug is totally realistic