potocpav / npy-rs

NumPy file format (de-)serialization in Rust
30 stars 7 forks source link

Arrays with ndim != 1 #10

Open ExpHP opened 5 years ago

ExpHP commented 5 years ago

Is there any particular reason why is the current implementation restricted to 1-dimensional arrays?

In particular, right now I want to use this crate to read a bsr_matrix saved by scipy.sparse.save_npz. The output contains a data.npy which has a header like this:

{'descr': '<f8', 'fortran_order': False, 'shape': (1358814, 3, 3), }

I already have a type of my own that can represent this; all I really need from the npy crate are its shape and a raw Vec<f64> of all the data.

More generally, I think that the reasonable behavior for this crate when ndim != 1 is to continue doing exactly what it already does; give a flat, 1-dimensional iterator of the data's scalars in reading order, and have to_vec() produce a flat 1D Vec<_>. If anybody feels the need to make the dimensionality of the data more "substantial," they're free to inspect the shape attribute and go nuts.

ExpHP commented 5 years ago

A note: The public API I've been prototyping for n-dimensional arrays is:

Reading

The existing API is good enough, we just need a few more getters:

pub enum Order { C, Fortran }

impl<'a, T: Deserialize> NpyData<'a, T> {
    /// Get the shape as written in the file.
    pub fn shape(&self) -> &[usize] {
        &self.shape
    }

    /// Get strides for each of the dimensions.
    ///
    /// This is the amount by which the item index changes as you move along each dimension.
    /// It is a function of both [`NpyData::order`] and [`NpyData::shape`],
    /// provided for your convenience.
    pub fn strides(&self) -> &[usize] {
        &self.strides
    }

    /// Get whether the data is in C order or fortran order.
    pub fn order(&self) -> Order {
        self.order
    }
}

Documentation of get, to_vec and etc. are updated to clarify that they access the items in a flattened layout, and it is suggested that the user look at .shape() and .strides() to figure out how their n-dimensional data is arranged.

I did not bother adding anything like e.g. get_nd(nd_index: &[usize]) -> Option<&T> because it opens a rather large can of worms and I doubt it could be made performant (so I don't want to encourage working with the data in this manner).


Writing

With #16, #17, and this, I found myself wrestling with 2x2x2=8 constructors for NpyWriter. That's far too much! So I introduced a Builder API:

/// Configuration for an output `.NPY` file.
pub struct Builder<Row>;

// returns Builder::new()
impl<Row> Default for Builder<Row> { ... }

impl<Row> Builder<Row> {
    /// Construct a builder with default configuration.
    ///
    /// Data order will be initially set to C order.
    ///
    /// No dtype will be configured; the [`Builder::dtype`] method **must** be called.
    pub fn new() -> Self;

    /// Set the data order for arrays with more than one dimension.
    pub fn order(mut self, order: Order) -> Self;

    /// Use the specified dtype.
    pub fn dtype(mut self, dtype: DType) -> Self;

    /// Calls [`Builder::dtype`] with the default dtype for the type to be serialized.
    pub fn default_dtype(self) -> Self
    where Row: AutoSerialize;
}

impl<Row: Serialize> Builder<Row> {
    /// Begin writing an array of known shape.
    ///
    /// Panics if [`Builder::dtype`] was not called.
    pub fn begin_with_shape<W: Write>(&self, w: W, shape: &[usize]) -> io::Result<NpyWriter<Row, W>>;

    /// Begin writing a 1d array, of length to be inferred.
    ///
    /// Panics if [`Builder::dtype`] was not called.
    pub fn begin_1d<W: Write + Seek>(&self, w: W) -> io::Result<NpyWriter<Row, W>>;
}
ExpHP commented 5 years ago

Hmmm, eliminating the Seek bound from begin_with_shape ended up being tricker than I thought it would be. If you try the PanicSeek newtype strategy, it ends up having to become public:

impl<Row: Serialize> Builder<Row> {
    pub fn begin_with_shape<W: Write>(&self, w: W, shape: &[usize])
        -> io::Result<NpyWriter<Row, PanicSeek<W>>>;   // <--------- ewww, gross

    pub fn begin_1d<W: Write + Seek>(&self, w: W)
        -> io::Result<NpyWriter<Row, W>>;
}

I had a bit of a clever idea to try hiding it with dynamic polymorphism, however, this introduces a lifetime:

pub(crate) enum MaybeSeek<'w, W> {
    Is(Box<dyn WriteSeek + 'w>),
    Isnt(W),
}

impl<Row: Serialize> Builder<Row> {
    pub fn begin_with_shape<'w, W: Write + 'w>(&self, w: W, shape: &[usize])
        -> io::Result<NpyWriter<'w, Row, W>>;

    pub fn begin_1d<'w, W: Write + Seek + 'w>(&self, w: W)
        -> io::Result<NpyWriter<'w, Row, W>>;
}

trait WriteSeek: Write + Seek {}
impl<W: Write + Seek> WriteSeek for W {}

I feel like it should be possible to remove this lifetime by turning the trait into WriteSeek<W> (which the compiler will assume borrows from W due to variance), however, the compiler doesn't seem to agree.

By the way, I benchmarked this, and the worst impact on performance I was able to register was a 5% slowdown, which occurs on plain dtypes (record dtypes are basically unaffected).

bluenote10 commented 4 years ago

My use case was to read multidimensional tensors, so this limitation is also quite a showstopper. Passing the shape information around separately is so ugly and error prone, considering that it is available from the file.

Could we not start simple: Only add a shape getter (the strides can be inferred from that, right?) and as a first step only fill it on read (to side step having many arguments on writer side for now)?

Edit: Looking over the documentation/code a bit more, I'm getting confused. The DType and Field types actually seem to provide that information. But as far as I can see, both of these types are public without being ever used by any public interface, i.e., the user simply can't access them. Is there just a function missing?

ExpHP commented 4 years ago

N.B. shape() alone does not imply strides because arrays can either be stored as C-order or Fortran order. strides() is just a helper method to help write code that takes this into account.

When I was working on this, @potocpav seemed to have been very busy with other things. At some point in the past, I began to get a bit impatient waiting for my PRs to move, and had determined that if I didn't hear back within another week that I would start fixing up examples and documentation and preparing to release my own fork under a new name... and then suddenly I heard back from the author on my PR. (but then he disappeared again!)

Edit: Looking over the documentation/code a bit more, I'm getting confused. The DType and Field types actually seem to provide that information. But as far as I can see, both of these types are public without being ever used by any public interface, i.e., the user simply can't access them. Is there just a function missing?

My memory is foggy but I definitely do recall seeing some confusing things regarding what is and what isn't public API.

You may want to check out my fork and see if that API makes more sense. (try git clone and cargo doc --open). According to the commit list, it has both nd support as well as allowing one numpy type to be deserialized as multiple rust types. I'm pretty sure that I also tried to clean up parts of the API that I found confusing, though honestly, it has been a long while and I don't remember too much any more about what I did: https://github.com/ExpHP/nippy


Edit 2021/07/01: I have now released https://crates.io/crates/npyz which has this feature and a lot more.