Closed spease closed 3 years ago
You mention windows. This may be platform-specific behavior. On unix systems I would fadvise POSIX_FADV_SEQUENTIAL
on the input file for copying. Googling tells me that the windows equivalent is FILE_FLAG_SEQUENTIAL_SCAN
. Have you tried that?
After a coarse overview, my intuition is that there is some funny business going on with the length determined by seek
and the subsequent inc
calls on the ProgressBar
, that could explain the unrealistic results.
It’s possible. The actual times did increase dramatically from the std::io::copy default though - my suspicion is that there is probably some caching to RAM going on or something as cold runs were generally slower than hot runs.
This was for some work code that I’m no longer actively developing. I’ll leave this email marked as unread for awhile in case I have some spare time to write a benchmark.
I don’t think the other flag that was mentioned is accessible with the default open file api, but in any case it at least appears there’s some low hanging fruit wrt to std::io::copy perf
I tried benchmarking io::copy
on Windoze before and after changing DEFAULT_BUF_SIZE
to 64KB
(instead of the original 8KB
) and the results were slightly worse (up to ~3%) with the modified setup, both on HDD and SSD. However, I'm not sure if my benchmarks were good enough - I was having a hard time tweaking them so that I could see any difference between results for different file sizes.
FWIW, I'm providing the benchmark code below; it can easily be expanded to different file sizes:
#[bench]
fn bench_copy_4MB(b: &mut Bencher) {
let mut file = test::black_box(File::open("4MB").unwrap());
let mut writer: Vec<u8> = Vec::with_capacity(4 * 1024 * 1024);
b.iter(|| {
let mut x = Ok(0);
for _ in 0..100 { x = io::copy(&mut file, &mut writer); writer.clear(); }
x
})
}
4MB isn’t a whole lot. That’s well within the size of hard drive caches or possibly an OS memory-based cache. I was noticing the difference with files several gigabytes in size.
As a sanity check, maybe look to see what the actual time is, not just the % difference between timings with different buffer sizes. The actual time may be wildly different than the actual steady state performance of the disk if it’s reading the data from a cache.
It may also be possible at a low level to pass a flag to encourage the OS to disable or flush the cache, but I don’t know if there’s any way to instruct the hard drive to do the same.
I only pasted the benchmark for 4MB as a reference - I tested files from 4KB up to 16MB. I'm not an expert with low-level hard drive stuff, just hoping my attempt can help someone else.
Ideally you would test with a file approximately twice as big as you have RAM, so if you have 8GB of RAM you would want to test with a 16GB file.
@spease extended open flags are accessible via https://doc.rust-lang.org/std/os/windows/fs/trait.OpenOptionsExt.html FWIW lots of bennchmarks exist showing synchronous IO calls improve when matching IO operations to the underlying block size (e.g. 4MB for modern SSD's) across multiple OS's, so its very likely that there are improvements to gain here. However, 4MB is stack-breaking size by default, and heap allocations will destroy potential improvements (and further, if we're optimising, we probably want overlapped IO and a ring of buffers rather than just one buffer which will always leave one device idle).... so its complex;)
heap allocations will destroy potential improvements
How come? io::copy
would allocate once at the beginning. Heap memory isn't inherently slower and one memory allocation is basically free compared to the heavy IO that's happening afterwards.
But we also don't want to be allocating 4MB every time we do io::copy
as someone might be doing a bunch of small copies.
possible approaches:
read
directly into the already allocated target, thus skipping the need for a bufferspecialize copy operations from file to memory to
read
directly into the already allocated target, thus skipping the need for a buffer
I think some kind of specialization for BufRead
sources would be ideal. Because in this case, io::copy
should not buffer at all and simply use the reader's buffer. Effectively this also gives you a customizable buffer size by wrapping your File
in a BufReader
with whatever size you want.
As of today, if your source is BufRead
then io::copy
is always the wrong answer. I got a substantial performance improvement using a simple loop instead:
loop {
let buf = reader.fill_buf()?;
if buf.is_empty() { break; }
let consumed = file_out.write(buf)?;
drop(buf);
reader.consume(consumed);
}
@main-- a single heap allocation is entirely reasonable for a single invocation of copy of a lot of data. For many copies of a small amount of data, one ends up with a lot of allocations again (consider for instance copying data out of a tar, where we have views on the data and there's no actual synchronous IO etc happening.
I like some of the suggested possibilities with specialisation.
I think some kind of specialization for
BufRead
sources would be ideal.
It's a nice general optimization, but not necesarily ideal if you use a Vec
as Write
. Adding a BufRead
on the reader size does let you control the buffer size but it still introduces an unnecessary memcopy compared to File::read_to_end
.
Perhaps a writer-side optimization would be more flexible. Then you can specialize either for BufWriter
or Vec
That sounds rather uncommon to me though. After all, read_to_end
is provided by the Read
trait so it's always available. The only case I can see where you would invoke io::copy
with a Vec
as destination is some generic interfaces that takes any W: Write
. But IMO such interfaces are ill-designed, as they could simply implement Read
instead and offer a much more flexible API.
yes, generic code taking a write was the concern. E.g. tar-rs has a tar archive Builder which takes a Write
and is fed with Read
items. I'm not saying that creating an in-memory archive would be a good idea, just pointing out that these patterns do exist.
I had same kind of issue but with different library (indicatif). The copy rate was around 70 MB/s (inside single NVMe SSD). The problem was actually with running the debug version of my program. When I built and ran the release version, the speed jumped to over 1 GB/s.
With #78641 merged you can now (on next nightly) use BufWriter
to control the chunk size of io::copy
.
Since this relies on specialization it only works if you pass a BufWriter
directly, any other wrapper around it would disable this optimization.
I was using std::io::copy from an SSD to memory using ProgressBarReader, which reported an eye-raising 30 MB/s. Wrapping it in a BufReader (either as the outer or middle layer) made no real difference.
So I googled around a bit and found the recommended size was 64KB according to some random Microsoft article, so I decided to copy-paste the std::io::copy implementation and increase the buffer size. This got up to 600 MB/s at about 256KB which sounded a lot more reasonable...then it continued going upwards from there past the point of believability (unless the files were being pulled from RAM as a consequence of my repeated accesses). With a 10MB Vec, I got 228.36 GB/s 🙃.
Unrealistic values aside, has anybody checked the buffer value to make sure there isn't some huge low-hanging fruit for std::io::copy speeds?