potocpav / npy-rs

NumPy file format (de-)serialization in Rust
30 stars 7 forks source link

Supporting 1 rust type <-> many npy types #12

Open ExpHP opened 5 years ago

ExpHP commented 5 years ago

There are many datatypes which can appear in NPY files which currently cannot be supported by the design of the Serialize trait. To name a few:

In order to support any of these, we need to get rid of the "one type string <--> one type" way of thinking. Reading and writing needs to depend on the type string:

// was:
pub trait Serializable : Sized {
    fn dtype() -> DType;
    fn n_bytes() -> usize;
    fn read(c: &[u8]) -> Self;
    fn write<W: Write>(&self, writer: &mut W) -> Result<()>;
}

// must now instead be something closer to:
pub trait Serializable : Sized {
    fn recommended_dtype() -> DType;
    fn read(&Dtype, &[u8]) -> Self;
    fn write<W: Write>(&self, &Dtype, &mut W) -> Result<()>;
}
impl DType {
    pub fn n_bytes(&self) -> usize;
}

Of course, having every read or write dispatch on the DType could incur a performance loss, so the above is not precisely the design we want.

I've been prototyping a solution myself, which I'll write about in a separate comment.

ExpHP commented 5 years ago

I am currently prototyping replacement traits for Serializable in my own fork.

At the moment, they are completely unused by the crate (I've built them from the bottom up in order to see what was feasible, and I plan to integrate them into the crate later). The bulk of the changes are in these files:

In just a couple of days they've already had to undergo numerous broad changes. At the time of writing, they look like this:

// implemented by i32, u64, Vec<u8>... but not [u8]
pub trait Deserialize: Sized {
    type Reader: TypeRead<Value=Self>;

    fn reader(type_str: &TypeStr) -> Result<Self::Reader, TypeStrUnsupportedError>;
}

// implemented by i32, u64, [u8]...
pub trait Serialize {
    type Writer: TypeWrite<Value=Self>;

    fn writer(type_str: &TypeStr) -> Result<Self::Writer, TypeStrUnsupportedError>;
}

// Not implemented for Vec<u8>, which can become either `|Sn` or `|Vn`.
pub trait AutoSerialize: Serialize {
    fn default_type_str() -> TypeStr;
}

// basically Fn(&[u8]) -> T with some overloadable methods for efficiency
pub trait TypeRead {
    ...
}

// sort of like:  for<W: io::Write> Fn(W, &T) -> io::Result<()>,
// if that type was expressible in rust
pub trait TypeWrite {
    ...
}

Some things that the new design can do:

(it also used to support e.g. reading |u4 as u64 or even i64 (types that impl From<u32>) but I couldn't find any way to make it easy to maintain, so I ripped it out)

It's reaching the point where I need to start figuring out how to fit it into the existing design, supporting the current usage of arrays and #[derive].

If the performance impact ends up being significant, then under this design it should be possible to introduce wrapper types like Le<i32> that only support a single endianness, and then add field annotations like #[npy(le)] to let them be used invisibly by the derives:

#[derive(npy::Serialize, npy::Deserialize)]
struct MyStruct {
    #[npy(le)] foo: i32,
    #[npy(le)] bar: i64,
}

...though I'm hoping it doesn't come to that.

Any thoughts, suggestions, or critique is welcome.