Open ExpHP opened 5 years ago
A note: The public API I've been prototyping for n-dimensional arrays is:
Reading
The existing API is good enough, we just need a few more getters:
pub enum Order { C, Fortran }
impl<'a, T: Deserialize> NpyData<'a, T> {
/// Get the shape as written in the file.
pub fn shape(&self) -> &[usize] {
&self.shape
}
/// Get strides for each of the dimensions.
///
/// This is the amount by which the item index changes as you move along each dimension.
/// It is a function of both [`NpyData::order`] and [`NpyData::shape`],
/// provided for your convenience.
pub fn strides(&self) -> &[usize] {
&self.strides
}
/// Get whether the data is in C order or fortran order.
pub fn order(&self) -> Order {
self.order
}
}
Documentation of get
, to_vec
and etc. are updated to clarify that they access the items in a flattened layout, and it is suggested that the user look at .shape()
and .strides()
to figure out how their n-dimensional data is arranged.
I did not bother adding anything like e.g. get_nd(nd_index: &[usize]) -> Option<&T>
because it opens a rather large can of worms and I doubt it could be made performant (so I don't want to encourage working with the data in this manner).
Writing
With #16, #17, and this, I found myself wrestling with 2x2x2=8 constructors for NpyWriter. That's far too much! So I introduced a Builder API:
/// Configuration for an output `.NPY` file.
pub struct Builder<Row>;
// returns Builder::new()
impl<Row> Default for Builder<Row> { ... }
impl<Row> Builder<Row> {
/// Construct a builder with default configuration.
///
/// Data order will be initially set to C order.
///
/// No dtype will be configured; the [`Builder::dtype`] method **must** be called.
pub fn new() -> Self;
/// Set the data order for arrays with more than one dimension.
pub fn order(mut self, order: Order) -> Self;
/// Use the specified dtype.
pub fn dtype(mut self, dtype: DType) -> Self;
/// Calls [`Builder::dtype`] with the default dtype for the type to be serialized.
pub fn default_dtype(self) -> Self
where Row: AutoSerialize;
}
impl<Row: Serialize> Builder<Row> {
/// Begin writing an array of known shape.
///
/// Panics if [`Builder::dtype`] was not called.
pub fn begin_with_shape<W: Write>(&self, w: W, shape: &[usize]) -> io::Result<NpyWriter<Row, W>>;
/// Begin writing a 1d array, of length to be inferred.
///
/// Panics if [`Builder::dtype`] was not called.
pub fn begin_1d<W: Write + Seek>(&self, w: W) -> io::Result<NpyWriter<Row, W>>;
}
shape
is provided at the begin
functions rather than via a setter, so that the W: Seek
bound can be removed. (No need to go back and write the shape if we know it from the very beginning!)
W: Seek
, but this technical limitation is easily worked around by introducing some private newtype wrapper that impls Seek
by panicking.open_with_dtype
introduced in #15 and NpyWriter::begin
introduced in #16 become redundant with the Builder API, which is why I would like to get this in before the next release so I can remove them! :PHmmm, eliminating the Seek
bound from begin_with_shape
ended up being tricker than I thought it would be. If you try the PanicSeek
newtype strategy, it ends up having to become public:
impl<Row: Serialize> Builder<Row> {
pub fn begin_with_shape<W: Write>(&self, w: W, shape: &[usize])
-> io::Result<NpyWriter<Row, PanicSeek<W>>>; // <--------- ewww, gross
pub fn begin_1d<W: Write + Seek>(&self, w: W)
-> io::Result<NpyWriter<Row, W>>;
}
I had a bit of a clever idea to try hiding it with dynamic polymorphism, however, this introduces a lifetime:
pub(crate) enum MaybeSeek<'w, W> {
Is(Box<dyn WriteSeek + 'w>),
Isnt(W),
}
impl<Row: Serialize> Builder<Row> {
pub fn begin_with_shape<'w, W: Write + 'w>(&self, w: W, shape: &[usize])
-> io::Result<NpyWriter<'w, Row, W>>;
pub fn begin_1d<'w, W: Write + Seek + 'w>(&self, w: W)
-> io::Result<NpyWriter<'w, Row, W>>;
}
trait WriteSeek: Write + Seek {}
impl<W: Write + Seek> WriteSeek for W {}
I feel like it should be possible to remove this lifetime by turning the trait into WriteSeek<W>
(which the compiler will assume borrows from W
due to variance), however, the compiler doesn't seem to agree.
By the way, I benchmarked this, and the worst impact on performance I was able to register was a 5% slowdown, which occurs on plain dtypes (record dtypes are basically unaffected).
My use case was to read multidimensional tensors, so this limitation is also quite a showstopper. Passing the shape information around separately is so ugly and error prone, considering that it is available from the file.
Could we not start simple: Only add a shape
getter (the strides can be inferred from that, right?) and as a first step only fill it on read (to side step having many arguments on writer side for now)?
Edit: Looking over the documentation/code a bit more, I'm getting confused. The DType
and Field
types actually seem to provide that information. But as far as I can see, both of these types are public without being ever used by any public interface, i.e., the user simply can't access them. Is there just a function missing?
N.B. shape()
alone does not imply strides because arrays can either be stored as C-order or Fortran order. strides()
is just a helper method to help write code that takes this into account.
When I was working on this, @potocpav seemed to have been very busy with other things. At some point in the past, I began to get a bit impatient waiting for my PRs to move, and had determined that if I didn't hear back within another week that I would start fixing up examples and documentation and preparing to release my own fork under a new name... and then suddenly I heard back from the author on my PR. (but then he disappeared again!)
Edit: Looking over the documentation/code a bit more, I'm getting confused. The DType and Field types actually seem to provide that information. But as far as I can see, both of these types are public without being ever used by any public interface, i.e., the user simply can't access them. Is there just a function missing?
My memory is foggy but I definitely do recall seeing some confusing things regarding what is and what isn't public API.
You may want to check out my fork and see if that API makes more sense. (try git clone
and cargo doc --open
). According to the commit list, it has both nd support as well as allowing one numpy type to be deserialized as multiple rust types. I'm pretty sure that I also tried to clean up parts of the API that I found confusing, though honestly, it has been a long while and I don't remember too much any more about what I did: https://github.com/ExpHP/nippy
Edit 2021/07/01: I have now released https://crates.io/crates/npyz which has this feature and a lot more.
Is there any particular reason why is the current implementation restricted to 1-dimensional arrays?
In particular, right now I want to use this crate to read a
bsr_matrix
saved byscipy.sparse.save_npz
. The output contains adata.npy
which has a header like this:I already have a type of my own that can represent this; all I really need from the
npy
crate are its shape and a rawVec<f64>
of all the data.More generally, I think that the reasonable behavior for this crate when
ndim != 1
is to continue doing exactly what it already does; give a flat, 1-dimensional iterator of the data's scalars in reading order, and haveto_vec()
produce a flat 1DVec<_>
. If anybody feels the need to make the dimensionality of the data more "substantial," they're free to inspect theshape
attribute and go nuts.