std::io::copy performance may be 20x slower than it could be?

spease commented 6 years ago

I was using std::io::copy from an SSD to memory using ProgressBarReader, which reported an eye-raising 30 MB/s. Wrapping it in a BufReader (either as the outer or middle layer) made no real difference.

So I googled around a bit and found the recommended size was 64KB according to some random Microsoft article, so I decided to copy-paste the std::io::copy implementation and increase the buffer size. This got up to 600 MB/s at about 256KB which sounded a lot more reasonable...then it continued going upwards from there past the point of believability (unless the files were being pulled from RAM as a consequence of my repeated accesses). With a 10MB Vec, I got 228.36 GB/s 🙃.

Unrealistic values aside, has anybody checked the buffer value to make sure there isn't some huge low-hanging fruit for std::io::copy speeds?

the8472 commented 6 years ago

You mention windows. This may be platform-specific behavior. On unix systems I would fadvise POSIX_FADV_SEQUENTIAL on the input file for copying. Googling tells me that the windows equivalent is FILE_FLAG_SEQUENTIAL_SCAN. Have you tried that?

Voultapher commented 6 years ago

After a coarse overview, my intuition is that there is some funny business going on with the length determined by seek and the subsequent inc calls on the ProgressBar, that could explain the unrealistic results.

spease commented 6 years ago

It’s possible. The actual times did increase dramatically from the std::io::copy default though - my suspicion is that there is probably some caching to RAM going on or something as cold runs were generally slower than hot runs.

This was for some work code that I’m no longer actively developing. I’ll leave this email marked as unread for awhile in case I have some spare time to write a benchmark.

I don’t think the other flag that was mentioned is accessible with the default open file api, but in any case it at least appears there’s some low hanging fruit wrt to std::io::copy perf

ljedrz commented 6 years ago

I tried benchmarking io::copy on Windoze before and after changing DEFAULT_BUF_SIZE to 64KB (instead of the original 8KB) and the results were slightly worse (up to ~3%) with the modified setup, both on HDD and SSD. However, I'm not sure if my benchmarks were good enough - I was having a hard time tweaking them so that I could see any difference between results for different file sizes.

FWIW, I'm providing the benchmark code below; it can easily be expanded to different file sizes:

#[bench]
fn bench_copy_4MB(b: &mut Bencher) {
    let mut file = test::black_box(File::open("4MB").unwrap());
    let mut writer: Vec<u8> = Vec::with_capacity(4 * 1024 * 1024);

    b.iter(|| {
        let mut x = Ok(0);
        for _ in 0..100 { x = io::copy(&mut file, &mut writer); writer.clear(); }
        x
    })
}

spease commented 6 years ago

4MB isn’t a whole lot. That’s well within the size of hard drive caches or possibly an OS memory-based cache. I was noticing the difference with files several gigabytes in size.

As a sanity check, maybe look to see what the actual time is, not just the % difference between timings with different buffer sizes. The actual time may be wildly different than the actual steady state performance of the disk if it’s reading the data from a cache.

It may also be possible at a low level to pass a flag to encourage the OS to disable or flush the cache, but I don’t know if there’s any way to instruct the hard drive to do the same.

ljedrz commented 6 years ago

I only pasted the benchmark for 4MB as a reference - I tested files from 4KB up to 16MB. I'm not an expert with low-level hard drive stuff, just hoping my attempt can help someone else.

retep998 commented 6 years ago

Ideally you would test with a file approximately twice as big as you have RAM, so if you have 8GB of RAM you would want to test with a 16GB file.

rbtcollins commented 5 years ago

@spease extended open flags are accessible via https://doc.rust-lang.org/std/os/windows/fs/trait.OpenOptionsExt.html FWIW lots of bennchmarks exist showing synchronous IO calls improve when matching IO operations to the underlying block size (e.g. 4MB for modern SSD's) across multiple OS's, so its very likely that there are improvements to gain here. However, 4MB is stack-breaking size by default, and heap allocations will destroy potential improvements (and further, if we're optimising, we probably want overlapped IO and a ring of buffers rather than just one buffer which will always leave one device idle).... so its complex;)

main-- commented 4 years ago

heap allocations will destroy potential improvements

How come? io::copy would allocate once at the beginning. Heap memory isn't inherently slower and one memory allocation is basically free compared to the heavy IO that's happening afterwards.

retep998 commented 4 years ago

But we also don't want to be allocating 4MB every time we do io::copy as someone might be doing a bunch of small copies.

the8472 commented 4 years ago

possible approaches:

thread local buffers or a buffer pool
stat the input before deciding on the buffer size (this is what gnu cp does I think)
heap-allocate a bigger buffer after a few iterations of the copy loop with an on-stack buffer
specialize copy operations from file to memory to read directly into the already allocated target, thus skipping the need for a buffer

main-- commented 4 years ago

specialize copy operations from file to memory to read directly into the already allocated target, thus skipping the need for a buffer

I think some kind of specialization for BufRead sources would be ideal. Because in this case, io::copy should not buffer at all and simply use the reader's buffer. Effectively this also gives you a customizable buffer size by wrapping your File in a BufReader with whatever size you want.

As of today, if your source is BufRead then io::copy is always the wrong answer. I got a substantial performance improvement using a simple loop instead:

loop {
    let buf = reader.fill_buf()?;
    if buf.is_empty() { break; }
    let consumed = file_out.write(buf)?;
    drop(buf);
    reader.consume(consumed);
}

rbtcollins commented 4 years ago

@main-- a single heap allocation is entirely reasonable for a single invocation of copy of a lot of data. For many copies of a small amount of data, one ends up with a lot of allocations again (consider for instance copying data out of a tar, where we have views on the data and there's no actual synchronous IO etc happening.

I like some of the suggested possibilities with specialisation.

the8472 commented 4 years ago

I think some kind of specialization for BufRead sources would be ideal.

It's a nice general optimization, but not necesarily ideal if you use a Vec as Write. Adding a BufRead on the reader size does let you control the buffer size but it still introduces an unnecessary memcopy compared to File::read_to_end.

Perhaps a writer-side optimization would be more flexible. Then you can specialize either for BufWriter or Vec

main-- commented 4 years ago

That sounds rather uncommon to me though. After all, read_to_end is provided by the Read trait so it's always available. The only case I can see where you would invoke io::copy with a Vec as destination is some generic interfaces that takes any W: Write. But IMO such interfaces are ill-designed, as they could simply implement Read instead and offer a much more flexible API.

the8472 commented 4 years ago

yes, generic code taking a write was the concern. E.g. tar-rs has a tar archive Builder which takes a Write and is fed with Read items. I'm not saying that creating an in-memory archive would be a good idea, just pointing out that these patterns do exist.

Niketin commented 4 years ago

I had same kind of issue but with different library (indicatif). The copy rate was around 70 MB/s (inside single NVMe SSD). The problem was actually with running the debug version of my program. When I built and ran the release version, the speed jumped to over 1 GB/s.

the8472 commented 3 years ago

With #78641 merged you can now (on next nightly) use BufWriter to control the chunk size of io::copy. Since this relies on specialization it only works if you pass a BufWriter directly, any other wrapper around it would disable this optimization.

rust-lang / rust

std::io::copy performance may be 20x slower than it could be? #49921