Buffering File I/O - Githubissues

bluss commented 3 years ago

I just noticed when reading through, that write_npy uses unbuffered File I/O. For contig arrays this is ideal but for other cases it might be slower than it should be. Tested with the latest version from crates.io and cargo run --release.

This test program writes a 4G file - one may of course scale that down if wanted, should be simple to reproduce even with 10MB.

use ndarray::prelude::*;
use ndarray_npy::{self, WriteNpyExt};
use std::io::BufWriter;
use std::fs::File;
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    let mut array: Array<f32, _> = Array::zeros((1024, 1024, 1024));
    array.swap_axes(0, 1);
    // slow
    //ndarray_npy::write_npy("zeros.npy", &array)?;

    // "fast"
    let file = BufWriter::new(File::create("zeros.npy")?);
    array.write_npy(file)?;  // In real code maybe flush the BufWriter like it docs recommend

    Ok(())

}

I think I can recommend just using BufWriter. When BufWriter is passed data longer than its internal buffer; in that case it just bypasses the internal buffer and writes it all to the underlying file.

jturner314 commented 3 years ago

Thanks for looking over the crate and mentioning this issue. I've known about this, but didn't have a chance to try it to see what impact it would have (to know under what circumstances I should use buffering). I just tried a few cases in release mode based on your example code (plush a .flush() after the array.write_npy call):

Non-standard layout, without buffering: 0.1GB in 45.9s before I killed it
Non-standard layout, with buffering: 4.1GB in 10.2s
Non-standard layout, with nested buffering (BufWriter::new(BufWriter::new(...))): 4.1GB in 11.0s
Standard layout, without buffering: 4.1GB in 1.2s
Standard layout, with buffering: 4.1GB in 1.2s
Standard layout, with nested buffering (BufWriter::new(BufWriter::new(...))): 4.1GB in 1.2s

The difference between the non-buffering and buffering cases with non-standard layout is huge! I had no idea it was that significant. The overhead of nested buffering is noticeable, but not huge, in the non-standard layout case. In the standard layout case, there is effectively no difference regardless of buffering.

I think I can recommend just using BufWriter. When BufWriter is passed data longer than its internal buffer; in that case it just bypasses the internal buffer and writes it all to the underlying file.

This is good to know, because it means that the overhead of the BufWriter should be small even in the cases where the array data is written in a single large write (when the array is in standard or Fortran layout). This conclusion matches the results I obtained above.

These are a couple of other things that I'd like to test:

~~I wonder whether buffering would help when reading too, since reading the header performs quite a few small reads, even though reading the array data is a single large read.~~ Edit: The struck-through text is incorrect: the header is read in three reads, and the array data is read in a single read. I would think this would be few enough reads that buffering would be negligible.
I wonder about the best way to handle .npz archives (which are ZIP archives containing .npy files) with NpzWriter, since the current implementation uses WriteNpyExt::write_npy to write the .npy files into the .npz archive. I would guess that it would still be fine for WriteNpyExt::write_npy to wrap the writer in a BufWriter, but it may be a better to wrap the underlying writer in a BufWriter before wrapping it in the ZipWriter and NpzWriter.

I see a couple of options:

Change the write_npy function to always use buffering (since it's always working with a File), add a .flush() to WriteNpyExt and NpzWriter, and use BufWriter in the examples for WriteNpyExt and NpzWriter.
Change WriteNpyExt to always use buffering.

When I get a chance, I'll do some testing of the buffering overhead when writing to in-memory writers (to know whether it's potentially useful for the user to be able to write without buffering) and testing the approaches for handling .npz files.

bluss commented 3 years ago

I think option (1) would follow the Rust language conventions the closest, no wrapping that the user doesn't want to pay for, but I would understand if you think it leaves the user high and dry without buffering. If I understand correctly, the ZipWriter doesn't buffer by itself? NpzWriter(ZipWriter(BufWriter(File))) is the best layering - is my guess.

Maybe the best reason to avoid (2) is to not allocate a bufwriter unnecessarily? But it's hard to be perfect about that anyway.

jturner314 commented 3 years ago

NpzWriter(ZipWriter(BufWriter(File))) is the best layering - is my guess.

That was my intuition, too, but I just did some testing, and it turns out that's not the case. The results are actually pretty interesting.

I modified write_example in examples/simple.npz to write a third array of shape 256×1024×1024 with f32 element type and all elements zero, and I removed the call to read_example. The results are as follows:

Uncompressed (1.1GB .npz file):
- Unbuffered:
- Standard layout: 0.30s
- Non-standard layout: 111MB in 47.9s before I killed it
- Buffered around file (NpzWriter<BufWriter<File>>):
- Standard layout: 0.31s
- Non-standard layout: 5.4s
- Buffered when writing individual arrays (wrap the ZipWriter in BufWriter in add_array):
- Standard layout: 0.30s
- Non-standard layout: 2.9s
- Buffered around file (NpzWriter<BufWriter<File>>) and when writing individual arrays (wrap the ZipWriter in BufWriter in add_array), i.e. nested buffering:
- Standard layout: 0.30s
- Non-standard layout: 2.7s
Compressed (1MB .npz file; the all-zero data is highly compressible):
- Unbuffered:
- Standard layout: 2.5s
- Non-standard layout: 13.7s
- Buffered around file (NpzWriter<BufWriter<File>>):
- Standard layout: 1.8s
- Non-standard layout: 14.1s
- Buffered when writing individual arrays (wrap the ZipWriter in BufWriter in add_array):
- Standard layout: 1.8s
- Non-standard layout: 4.2s
- Buffered around file (NpzWriter<BufWriter<File>>) and when writing individual arrays (wrap the ZipWriter in BufWriter in add_array), i.e. nested buffering:
- Standard layout: 2.2s
- Non-standard layout: 3.7s
Uncompressed, in-memory writer (Cursor::new(Vec::new())) instead of File:
- Unbuffered:
- Standard layout: 0.6s
- Non-standard layout: 5.5s
- Buffered around cursor (NpzWriter<BufWriter<Cursor<Vec<u8>>>>):
- Standard layout: 0.6s
- Non-standard layout: 4.9s
- Buffered when writing individual arrays (wrap the ZipWriter in BufWriter in add_array):
- Standard layout: 0.6s
- Non-standard layout: 2.5s
- Buffered around cursor (NpzWriter<BufWriter<Cursor<Vec<u8>>>>) and when writing individual arrays (wrap the ZipWriter in BufWriter in add_array), i.e. nested buffering:
- Standard layout: 0.6s
- Non-standard layout: 2.2s
Compressed, in-memory writer (Cursor::new(Vec::new())) instead of File:
- Unbuffered:
- Standard layout: 1.8s
- Non-standard layout: 13.5s
- Buffered around cursor (NpzWriter<BufWriter<Cursor<Vec<u8>>>>):
- Standard layout: 1.8s
- Non-standard layout: 13.9s
- Buffered when writing individual arrays (wrap the ZipWriter in BufWriter in add_array):
- Standard layout: 2.2s
- Non-standard layout: 4.3s
- Buffered around cursor (NpzWriter<BufWriter<Cursor<Vec<u8>>>>) and when writing individual arrays (wrap the ZipWriter in BufWriter in add_array), i.e. nested buffering:
- Standard layout: 2.3s
- Non-standard layout: 3.6s

Observations:

Non-standard layout:
- Surprisingly, buffering for both individual arrays and around the underlying writer was the best. [Edit: On second thought, this makes sense for the compressed cases because the data is so highly compressible. The results for the uncompressed cases are still surprising, although the differences compared to just buffering for individual arrays are smaller than for the compressed cases.]
- Just buffering for individual arrays was a fairly close second.
Standard layout:
- Just buffering for individual arrays had no effect in the uncompressed cases, helped in the "compressed, file" case, and hurt in the "compressed, in-memory writer" case.
- Buffering for both individual arrays and around the underlying writer was somewhat worse than just buffering for individual arrays.

In conclusion, it's reasonable for NpzWriter to always buffer around individual arrays, because it helps substantially in the non-standard layout cases, and the "compressed, file, standard layout" case is more important than the "compressed, in-memory writer, standard layout" case. The allocations for the BufWriters could possibly hurt performance for small arrays, but writing small arrays doesn't take much time anyway. The user may want to also wrap the underlying writer in a BufWriter if they know that their arrays have non-standard layouts, but I wouldn't recommend it in general.

I think option (1) would follow the Rust language conventions the closest, no wrapping that the user doesn't want to pay for, but I would understand if you think it leaves the user high and dry without buffering.

Yeah, I agree that option (1) most closely follows existing conventions. The only reason to prefer (2) would be to make things a little easier for the user in the common case (so that they wouldn't have to wrap the File in a BufWriter themselves). Option (1) is fine, though, IMO, because once I add a .flush() to WriteNpyExt::write_npy, all the user has to do is add a call to BufWriter::new, and flushing will be handled automatically.

Thanks for talking it through with me.

jturner314 commented 3 years ago

I've created #51 to resolve this issue.

jturner314 / ndarray-npy

Buffering File I/O #50