rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
99.09k stars 12.79k forks source link

Moving line "hash_vec.push(h)" slows code 1.6x #113504

Closed safinaskar closed 1 year ago

safinaskar commented 1 year ago

Merely moving line hash_vec.push(h) from one place to another slows code down 1.6x. This is very unfortunate, because this prevents me from converting my sequential code to rayon (I will explain why). Also, all these problems appeared in real production code base, I will explain this.

So, here is two code examples, I will call them UPLMUT (for lack of better names):

// UPLMUT slow
pub fn main() {
    let mut hash_vec = vec![];
    for n in 0.. {
        let buf = vec![0u8; 4194304];
        let h;
        if let Some(item) = (if n < 100 { Some(buf) } else { None }) {
            for _ in 0..5 {
                std::hint::black_box(item.clone());
            }
            h = vec![0u8, 0u8];
        } else {
            break;
        }
        hash_vec.push(h);
    }
    std::hint::black_box(hash_vec);
}
// UPLMUT fast
pub fn main() {
    let mut hash_vec = vec![];
    for n in 0.. {
        let buf = vec![0u8; 4194304];
        let h;
        if let Some(item) = (if n < 100 { Some(buf) } else { None }) {
            for _ in 0..5 {
                std::hint::black_box(item.clone());
            }
            h = vec![0u8, 0u8];
            hash_vec.push(h);
        } else {
            break;
        }
    }
    std::hint::black_box(hash_vec);
}

The only difference is placement of hash_vec.push(h). Yet "UPLMUT slow" is slower than "UPLMUT fast" 1.6 times. Here is result of running "UPLMUT slow" 10 times and then "UPLMUT fast" 10 times (using time -p):

real 0.28
user 0.19
sys 0.08
real 0.25
user 0.16
sys 0.08
real 0.28
user 0.17
sys 0.10
real 0.28
user 0.17
sys 0.11
real 0.28
user 0.18
sys 0.09
real 0.28
user 0.17
sys 0.10
real 0.28
user 0.18
sys 0.10
real 0.28
user 0.17
sys 0.11
real 0.28
user 0.17
sys 0.10
real 0.28
user 0.18
sys 0.10
real 0.28
user 0.19
sys 0.08
real 0.28
user 0.17
sys 0.10
real 0.28
user 0.19
sys 0.09
real 0.28
user 0.18
sys 0.09
real 0.28
user 0.16
sys 0.12
real 0.28
user 0.18
sys 0.09
real 0.29
user 0.19
sys 0.09
real 0.28
user 0.18
sys 0.09
real 0.28
user 0.18
sys 0.09
real 0.28
user 0.20
sys 0.08
real 0.28
user 0.17
sys 0.10
real 0.28
user 0.19
sys 0.08
real 0.24
user 0.14
sys 0.10
real 0.28
user 0.18
sys 0.09
real 0.28
user 0.19
sys 0.09
real 0.25
user 0.16
sys 0.08
real 0.27
user 0.18
sys 0.09
real 0.28
user 0.19
sys 0.09
real 0.27
user 0.15
sys 0.12
real 0.28
user 0.20
sys 0.08
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.17
sys 0.00
real 0.22
user 0.21
sys 0.00
real 0.18
user 0.17
sys 0.00
real 0.21
user 0.20
sys 0.00
real 0.21
user 0.20
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.16
user 0.15
sys 0.00
real 0.17
user 0.17
sys 0.00
real 0.17
user 0.17
sys 0.00
real 0.16
user 0.15
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.17
sys 0.00
real 0.18
user 0.17
sys 0.00
real 0.17
user 0.17
sys 0.00
real 0.16
user 0.16
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.17
sys 0.00
real 0.18
user 0.18
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.17
sys 0.00
real 0.17
user 0.17
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.17
sys 0.00

Test system is Debian Sid Linux x86_64 with rustc 1.72.0-nightly (f7ca9df69 2023-06-24) running in Docker container in Linux 5.10.0.

Here is godbolt with assembly diff: https://godbolt.org/z/orT9oPcsr .

You may say something like this: "well, you just call malloc and free in different order, so there is no surprise performance is different". No! Look carefully at this code. This two sources perform the same algorithm. I. e. merely optimization should transform one to another. So I'm nearly sure there is a bug in rustc, not in malloc, not in some other place.

You may say: "This is artificial example. Also, if hash_vec.push(h) changes performance, just put it in proper place and everything will be fast". No! This appeared in my real code, and I will explain.

Here is version, which is slightly closer to my real code. I will call it YLAURR.

// YLAURR slow
fn main() {
    let mut hash_vec = vec![];
    (0..100).map(|_|vec![0u8; 4194304]).map(|item|{
            for _ in 0..5 {
                std::hint::black_box(item.clone());
            }
            vec![0u8, 0u8]
    }).for_each(|h|hash_vec.push(h));
    std::hint::black_box(hash_vec);
}
// YLAURR fast
fn main() {
    let mut hash_vec = vec![];
    (0..100).map(|_|vec![0u8; 4194304]).for_each(|item|{
            for _ in 0..5 {
                std::hint::black_box(item.clone());
            }
            hash_vec.push(vec![0u8, 0u8]);
    });
    std::hint::black_box(hash_vec);
}

In this case splitting for_each(...) to map(...).for_each(...) makes code significantly slower. You may again say that this is artificial example or you may say "Just don't split for_each to map.for_each".

So, again, here is example even closer to my real code. I will call it WWUUIJ.

[package]
name = "my-chunker"
version = "0.1.0"
edition = "2021"

[dependencies]
blake3 = "1.4.0"
hex = "0.4.3"
libc = "0.2.147"
rayon = "1.7.0"
zstd = "0.12.3"
// fast
use std::io::Read;

fn main() {
        let chunk_size: usize = 4194304;

        let mut hash_vec = vec![];
        let mut input = std::io::stdin().lock();
        std::iter::from_fn(|| {
            let mut buf = vec![0u8; chunk_size];
            if let Ok(()) = input.read_exact(&mut buf) {
                Some(buf)
            } else {
                None
            }
        }).for_each(|chunk| {
                        let raw_hash = blake3::hash(&chunk);
                        std::hint::black_box(zstd::bulk::compress(&chunk, -22).unwrap());
                        let x: Vec<u8> = raw_hash.as_bytes().to_vec();
                        hash_vec.push(x);
        });
        std::hint::black_box(hash_vec);
}
// slow
use std::io::Read;

fn main() {
        let chunk_size: usize = 4194304;

        let mut hash_vec = vec![];
        let mut input = std::io::stdin().lock();
        std::iter::from_fn(|| {
            let mut buf = vec![0u8; chunk_size];
            if let Ok(()) = input.read_exact(&mut buf) {
                Some(buf)
            } else {
                None
            }
        }).map(|chunk| {
                        let raw_hash = blake3::hash(&chunk);
                        std::hint::black_box(zstd::bulk::compress(&chunk, -22).unwrap());
                        let x: Vec<u8> = raw_hash.as_bytes().to_vec();
                        x
        }).for_each(|h| {
            hash_vec.push(h);
        });
        std::hint::black_box(hash_vec);
}

Pipe to program's stdin 2 GiB of /dev/urandom while testing.

If the example still looks artificial, okay, here is real story behind all this. I'm trying to write my replacement for borg and casync. I wrote about it here: https://github.com/borgbackup/borg/issues/7674#issuecomment-1607612493 . My program supports two operations: "add" and "extract". Operation "add" splits input file to chunks and then for every chunk compresses it, writes it to its own "chunk file" and also adds its hash to (single) "index file". So programs algorithm can be described using such pseudocode:

// Pseudo-code
let iter = /* Iterator, which produces chunks */
iter.for_each(|chunk| {
  // Compress chunk
  // Write compressed data to (its own) "chunk file"
  // Add hash to (single) "index file"
});

Now I decided to parallelize the program using rayon. As a first step I split above closure to two parts: one parallelizable and one sequential:

// Pseudo-code
let iter = /* Iterator, which produces chunks */
iter.map(|chunk| {
  // Compress chunk
  // Write compressed data to (its own) "chunk file"
  let hash = /* ... */
  return hash;
}).for_each(|h|{
  // Add hash to (single) "index file"
});

Unfortunately, such splitting surprisingly made the program slower! So, first step of converting the program to rayon (to make it faster!) made it slower. This is very bad. I started to investigate why it is so. So I reduced the problem to code WWUUIJ, then to YLAURR and finally to UPLMUT. So, I think core problem is in rustc. And I hope I convinced you that this bug is totally realistic

safinaskar commented 1 year ago

@rustbot label +I-slow +T-compiler +WG-codegen +A-codegen

the8472 commented 1 year ago

I doubt that the rather minor assembly changes could result in such a huge difference. Godbolt might not be representative in this case, maybe due to a difference in optimization flags.

The first example spends time in the kernel while the second one doesn't. That could indicate that calls to the allocator were optimized out in one case (making this a benchmarking artifact) and not in the other.

Try running it under a profiler or objdumping the assembly and diffing those.

saethlin commented 1 year ago

I reduced the first example to this:

pub fn main() {
    let mut hash_vec = vec![];
    for n in 0..100 {
        let buf = vec![0u8; 4194304];
        for _ in 0..5 {
            std::hint::black_box(buf.clone());
        }
        let h = vec![0u8, 0u8];
        drop(buf);
        hash_vec.push(h);
    }   
    std::hint::black_box(hash_vec);
}

Moving the drop(buf) to after the push produces perf stat output that matches the fast version. The difference in drop order is hard to see in the original code because buf is moved into the if let, so its drop location is not easy to see in relation to the push.

Based on profiling, the difference seems attributable to a 44x (!!) difference in the number of page faults between the two implementations. If I swap in jemalloc or mimalloc, the difference in runtime and page faults goes away. So I strongly suspect that this code is generating some worst-case behavior in glibc's allocator.

If you have an allocation-intensive workload it is always good advice to use a non-default allocator. This effect is surprisingly dramatic, but I don't think there's anything for us to do here.

safinaskar commented 1 year ago

I added Vec::with_capacity to my actual code and now everything is fast. So, problem is solved for me!

Also, I redirected the bug to glibc (I rewrote @saethlin's code to C): https://sourceware.org/bugzilla/show_bug.cgi?id=30625 , so I'm closing the bug here