jturner314 / ndarray-npy

.npy and .npz file format support for ndarray
https://docs.rs/ndarray-npy
Apache License 2.0
56 stars 18 forks source link

adds `read_npz` and `write_npz`, convenience wrappers #46

Open jonathanstrong opened 3 years ago

jonathanstrong commented 3 years ago

these work like read_npy and write_npy but write compressed .npz files instead. I wanted this functionality for writing ephemeral array files to be able to check something later if needed without taking up too much disk space.

in comparing to read_npy/write_npy, there is one major difference: since a npz file can contain multiple named arrays/files, this picks a default name for the single array it writes with write_npz, while allowing the user to specify the name to extract with read_npz. this may not be the best choice, but it seemed less than ideal to not permit specifying the name in read_npz, and I wanted write_npz to remain as simple as possible.

I picked the default name for write_npz based on what numpy does in savez_compressed ("arr_0.npy"). however, I think there is a divergence there. using np.load, you will get a dict-like object that allows you to access the arrays without the .npy extension (i.e. at key arr_0). however, using NpzReader, you need to use the full arr_0.npy name to retrieve the same array. just wanted to flag as this tripped me up a bit.

thanks for your consideration of this pull request.

jturner314 commented 3 years ago

using np.load, you will get a dict-like object that allows you to access the arrays without the .npy extension (i.e. at key arr_0). however, using NpzReader, you need to use the full arr_0.npy name to retrieve the same array. just wanted to flag as this tripped me up a bit.

Thanks for pointing this out. I've created #48 to track this issue.

Thanks also for the PR. There are a few things about the proposed API which are unsatisfying to me:

Creating a .npz archive for a single array seems somewhat awkward. I wonder if you'd be happier using a single-file compression format (such as .gz, .xz, .bz2, or .zst) applied to a .npy file instead of using a .zip/.npz archive. This would avoid the problem of choosing a name for the array in the archive and would avoid the complexity of the .zip format. For example, to write/read a .npy.gz file using ndarray-npy, you could do this:

use flate2::{bufread::GzDecoder, write::GzEncoder, Compression};
use ndarray::{array, Array2};
use ndarray_npy::{ReadNpyError, ReadNpyExt, WriteNpyError, WriteNpyExt};
use std::fs::File;
use std::io::{BufReader, BufWriter, Write};
use std::path::Path;

fn write_npy_gz<P, T>(path: P, array: &T) -> Result<(), WriteNpyError>
where
    P: AsRef<Path>,
    T: WriteNpyExt,
{
    // Note: I'm not sure if the `BufWriter` actually helps or not.
    let mut writer = GzEncoder::new(BufWriter::new(File::create(path)?), Compression::default());
    array.write_npy(&mut writer)?;
    writer.finish()?.flush()?;
    Ok(())
}

fn read_npy_gz<P, T>(path: P) -> Result<T, ReadNpyError>
where
    P: AsRef<Path>,
    T: ReadNpyExt,
{
    // Note: I'm not sure if the `BufReader` actually helps or not.
    T::read_npy(GzDecoder::new(BufReader::new(File::open(path)?)))
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let arr1 = array![[1, 2, 3], [4, 5, 6]];

    // Write the array.
    write_npy_gz("foo.npy.gz", &arr1)?;

    // Read it back.
    let arr2: Array2<i32> = read_npy_gz("foo.npy.gz")?;

    println!("arr1:\n{}", arr1);
    println!("arr2:\n{}", arr2);
    assert_eq!(arr1, arr2);

    Ok(())
}

To read it with NumPy, you could do this:

import numpy as np
import gzip

def load_npy_gz(path):
    with gzip.open(path) as f:
        return np.load(f)

arr = load_npy_gz('foo.npy.gz')
print(arr)

(You could also decompress .npy.gz files at the command line using gunzip.)