Support one rust type serializing as many dtypes

ExpHP commented 5 years ago

This is now ready for review!

I have chosen to finally submit a PR now that cargo test --all succeeds, and the behavior of all of the existing examples has been verified. Some work still remains to be done.

Overview

Adds support for non-little endianness.
Adds de/serialization of the following dtypes:
- Byte strings |Sn and binary blobs |Vn (as Vec<u8>)
- Datetimes <M8[us] and timedeltas <m8[us] (as u64 and i64)
Serializable had to be split into three traits to accomodate these features. More about this later.
Adds full parsing and validation of string descrs, hopefully making it easier to add support for complex numbers, bools, np.float16, np.float128, and unicode strings in the future.

This is a large PR. However, as of Thursday June 6, I rewrote the commit history so that each commit can be easily reviewed. (there is no longer anything introduced in one commit that gets changed in another)

Closes #11. Closes #12. Closes #19.

I will add a comment which itemizes all of the changes to the public API and the reasoning behind them.

ExpHP commented 5 years ago

Summary of all changes

Additions / Big changes

The new serialization traits

Serializable has been split up quite a bit, because not all operations can be done on all types.

/// Trait that permits reading a type from an `.npy` file.
pub trait Deserialize: Sized {
    /// Think of this as a `Fn(&[u8]) -> Self`, with bonuses.
    ///
    /// Unfortunately, until rust supports existential associated types, actual closures
    /// cannot be used here, and you must define something that manually implements [`TypeRead`].
    type Reader: TypeRead<Value=Self>;

    /// Get a function that deserializes a single data field at a time
    ///
    /// The function receives a byte buffer containing at least
    /// `dtype.num_bytes()` bytes.
    ///
    /// # Errors
    ///
    /// Returns `Err` if the `DType` is not compatible with `Self`.
    fn reader(dtype: &DType) -> Result<Self::Reader, DTypeError>;
}

/// Trait that permits writing a type to an `.npy` file.
pub trait Serialize {
    /// Think of this as some sort of `for<W: io::Write> Fn(W, &Self) -> io::Result<()>`.
    ///
    /// Unfortunately, rust does not have generic closures, so you must manually define
    /// your own implementor of the [`TypeWrite`] trait.
    type Writer: TypeWrite<Value=Self>;

    /// Get a function that serializes a single data field at a time.
    ///
    /// # Errors
    ///
    /// Returns `Err` if the `DType` is not compatible with `Self`.
    fn writer(dtype: &DType) -> Result<Self::Writer, DTypeError>;
}

/// Subtrait of [`Serialize`] for types which have a reasonable default `DType`.
///
/// This opens up some simpler APIs for serialization. (e.g. [`::to_file`])
pub trait AutoSerialize: Serialize {
    /// A suggested format for serialization.
    ///
    /// The builtin implementations for primitive types generally prefer `|` endianness if possible,
    /// else the machine endian format.
    fn default_dtype() -> DType;
}

Deserialize is implemented for primitive ints, floats, and Vec<u8>. (but not [u8], which is unsized)
Serialize is implemented for primitive ints, floats, Vec<u8>, [u8], and for things behind a variety of pointer types.
AutoSerialize is implemented for primitive ints and floats. (but not Vec<u8>/[u8], for which |Sn and |Vn both sound reasonable and both have disadvantages)

Where's `n_bytes`?

It's on DType now. There is no other way to possibly support e.g. |V42.

impl DType {
    pub fn num_bytes(&self) -> usize;
}

Worth noting is that, unfortunately, this means the compiler can no longer constant-fold the sizes for large records. To mitigate that, we now have...

Two-stage de/serialization

So, what's the deal with the Reader and Writer types? Basically, types now need to validate their DTypes and possibly do different things based on what it contains.

To ensure that this can be done efficiently, de/serialization now takes place in two stages:

DType validation. (and potentially caching useful info like offsets)
The actual reading/writing.

The first step is, of course, done by Serialize::writer and Deserialize::reader already seen above. The second step is done by these methods:

pub trait TypeRead {
    type Value;

    fn read_one<'a>(&self, bytes: &'a [u8]) -> (Self::Value, &'a [u8]);
}

pub trait TypeWrite {
    type Value: ?Sized;

    fn write_one<W: io::Write>(&self, writer: W, value: &Self::Value) -> io::Result<()>
    where Self: Sized;
}

Needless to say, manually implementing these traits has become a fair bit of a chore now. Please see the updated roundtrip example.

The error type exposes a single public constructor for use by manual impls of the traits.

/// Indicates that a particular rust type does not support serialization or deserialization
/// as a given [`DType`].
#[derive(Debug, Clone)]
pub struct DTypeError(ErrorKind);

impl fmt::Display for DTypeError { ... }
impl std::error::Error for DTypeError { ... }

impl DTypeError {
    pub fn custom<S: AsRef<str>>(msg: S) -> Self;
}

One more trait: `TypeWriteDyn`

There's... this... thing.

pub trait TypeWriteDyn: TypeWrite {
    #[doc(hidden)]
    fn write_one_dyn(&self, writer: &mut dyn io::Write, value: &Self::Value) -> io::Result<()>;
}

impl<T: TypeWrite> TypeWriteDyn for T { }

Long story short: dyn TypeWrite can't do anything, so you should use dyn TypeWriteDyn instead.

I'd remove this from the PR if I could to save it for later consideration... but currently some of the built-in impls use it. I need to do some benchmarking first.

`Outfile::open_with_dtype`

The existing methods for writing files require AutoSerialize instead of Serialize. One new method, OutFile::open_with_dtype, was added for types that only implement the latter.

impl<Row: AutoSerialize> OutFile<Row> {
    /// Create a file, using the default format for the given type.
    pub fn open<P: AsRef<Path>>(path: P) -> io::Result<Self>;
}

impl<Row: Serialize> OutFile<Row> {
    pub fn open_with_dtype<P: AsRef<Path>>(dtype: &DType, path: P) -> io::Result<Self>;
}

pub fn to_file<S, T, P>(filename: P, data: T) -> ::std::io::Result<()> where
        P: AsRef<Path>,
        S: AutoSerialize,
        T: IntoIterator<Item=S>;

`DType::Plain.ty` changed from `String` -> `TypeStr`

The new TypeStr type is a fully parsed form of a stringlike descr. This is necessary so that reader and writer can easily match on various properties of the descr without having to look at string text.

/// Represents an Array Interface type-string.
///
/// This is more or less the `DType` of a scalar type.
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct TypeStr {
    /* no public fields */
}

impl fmt::Display for TypeStr { ... }
impl str::FromStr for TypeStr {
    type Error = ParseTypeStrError;
    ...
}

Parsing one generally produces validation errors for anything not accepted by the np.dtype function.

It currently exposes no public API for inspection or manipulation.

The error is just a garden-variety error type (Debug, Clone, Display, FromStr). I purposefully didn't use nom because nom gives terrible error messages.

There is a helper for "upgrading" a TypeStr to a DType:

impl DType {
    /// Construct a scalar `DType`. (one which is not a nested array or record type)
    pub fn new_scalar(ty: TypeStr) -> Self;
}

`"derive"` feature

This is a replacement for #[macro_use] extern crate npy_derive. Basically, we don't want #[derive(Serialize)] to clash with serde or other crates, so we instead recommend the following setup to users:

[dependencies]
npy-rs = { version = "0.5", features = ["derive"] }

extern crate npy;

#[derive(npy::Serialize, npy::Deserialize)]
struct MyStruct { a: i32, b: f64 }

Notice the above works even for 2015 edition crates. (the examples and doctests should attest to this!)

`NpyData::dtype`

impl<'a, T: Deserialize> NpyData<'a, T> {
    pub fn dtype(&self) -> DType;
}

This is necessary in order to be able to read an NPY file and then write a new one back that has the same format. I also wanted it for some of the tests.

Little Things

travis now checks that the examples/ build (because I had to update .travis.yml to add the feature flag, and at that point... why not?)
DType now derives Clone because I needed it in the derive macro.
npy_derive now depends on proc_macro2, whose types appear in the public API of syn and quote
unnecessary lifetime deleted from to_file

...and I think that's everything. (whew!)

Edit 1: Updated the signatures of TypeRead to reflect the new, faster API Edit 2:

Removed TypeRead::read_one_into (it can be added backwards compatibly later)
Removed "Helpers for creating |Sn and |Vn" (it can be added backwards compatibly later)
Added NpyData::dtype

ExpHP commented 5 years ago

Results of the existing bench.rs:

Before

running 2 tests
test read  ... bench:     112,028 ns/iter (+/- 6,674)
test write ... bench:     895,624 ns/iter (+/- 32,575)

After

running 2 tests
test read  ... bench:     463,360 ns/iter (+/- 153,967)   (~4x slowdown)
test write ... bench:   1,596,336 ns/iter (+/- 97,324)    (~2x slowdown)

Yeowch.

This is probably due to the double indirection in the current integer/float impls, which likely prevents inlining. There's no longer any good reason for them to have this indirection since I've decided not to support promotions like u32 -> u64, so we'll see how this improves with just a simple branch on endianness.

Edit: Further benchmarks

**Edit: After fixing indirection** ``` running 2 tests test read ... bench: 198,521 ns/iter (+/- 12,449) (x1.75 slowdown) test write ... bench: 1,092,576 ns/iter (+/- 61,045) (x1.25 slowdown) ``` **Edit: After adding dedicated Little Endian newtypes** ``` running 2 tests test le_read ... bench: 168,137 ns/iter (+/- 9,882) test le_write ... bench: 948,938 ns/iter (+/- 39,311) ``` --- The issue seems to be the fact that it no longer statically knows the strides. I have an idea for how to fix this: `read_one` will need to become something like ```rust pub trait TypeRead { type Value; fn read_one<'a>(&self, bytes: &'a [u8]) -> (Self::Value, &'a [u8]); } ```

ExpHP commented 5 years ago

The fn read_one(&self, bytes: &[u8]) -> (Self::Value, &[u8]); fix was wildly successful at optimizing the reads for scalar data and derived structs. Now my local branch is actually faster than master at reading most types! (that is, if you trust the current reading benchmarks... which I really don't)

For this reason I've decided against the inclusion of the fixed-endian wrapper types in this PR.

Before:

test array::read      ... bench:   1,325,069 ns/iter (+/- 43,806)
test array::write     ... bench:   4,602,994 ns/iter (+/- 614,125)
test one_field::read  ... bench:      95,717 ns/iter (+/- 4,101)
test one_field::write ... bench:     476,748 ns/iter (+/- 27,877)
test plain_f32::read  ... bench:      86,771 ns/iter (+/- 3,572)
test plain_f32::write ... bench:     561,184 ns/iter (+/- 9,845)
test simple::read     ... bench:     114,684 ns/iter (+/- 6,689)
test simple::write    ... bench:     869,041 ns/iter (+/- 50,057)

After:

test array::read         ... bench:   1,435,941 ns/iter (+/- 48,061)  (10% slowdown)
test array::write        ... bench:   4,567,738 ns/iter (+/- 160,708)
test one_field::read     ... bench:      56,118 ns/iter (+/- 1,440)   (50% speedup)
test one_field::write    ... bench:     477,967 ns/iter (+/- 28,492)
test plain_f32::read     ... bench:      55,995 ns/iter (+/- 3,302)   (50% speedup)
test plain_f32::write    ... bench:     477,693 ns/iter (+/- 12,268)  (15% speedup)
test simple::read        ... bench:      84,084 ns/iter (+/- 4,506)   (30% speedup)
test simple::write       ... bench:   1,091,575 ns/iter (+/- 18,112)  (25% slowdown)

ExpHP commented 5 years ago

I removed a couple of things that can be backwards-compatibly added back later. What remains are basically the things I need for unit tests.

ExpHP commented 5 years ago

This is finished now!

I completely rewrote the git history from scratch, organizing all of the changes into easily reviewable groups. You should be able to read each individual commit one by one and make sense of them.

ExpHP commented 5 years ago

Two notes:

I changed NpyData::dtype to return &DType instead of DType.
After this is merged, I would like to submit another PR to add support for n-dimensional arrays before the next version release.
- Basically, I'm currently prototyping support for shape.len() != 1 on my own fork, and the most reasonable API I was able to come up with makes open_with_dtype redundant (so I'd like to remove it before it becomes part of the published API).

ExpHP commented 5 years ago

@potocpav have you had any time to finish the review?

ExpHP commented 5 years ago

There's a couple of WTFs in this that I am aware about after cleaning them up on my fork. I can include those changes here if you want:

Using ByteOrder only for NativeEndian seemed kind of weird and required dumb hacks for u8, so I replaced it with a trait (not publically exposed) for reading primitives. This additionally made the macros not need to know method names or generate modules anymore.
impl_integer_serializable! was needlessly complicated, and the recursive loop structure could be replaced with a $()* repetition. The loop was a leftover from a previous design (one where widening conversions were supported, rather than requiring the size to be an exact match).

ExpHP commented 5 years ago

I'm also now having second thoughts about reading DateTimes/TimeDeltas as u64/i64 in this PR.

Basically, I think that at some point I'll want to add back the widening conversions for integers (because integer arrays produced in python code will often have a type that is dynamically chosen between <i4 and <i8), in which case DateTime as u64 doesn't feel right; there should be a dedicated DateTime wrapper type.

I really wish I could find a way to make this PR smaller...

(the only bit that I think can really be pulled out is the derive feature, but that accounts for relatively few changes)

ExpHP commented 5 years ago

Rats, I just realized DateTime should be an i64, not a u64. I'll just remove support for these for now.

xd009642 commented 4 years ago

Is there any progress on this, just cause I've come across this issue myself trying to deserialise u8 numpy arrays

ExpHP commented 3 years ago

So it is now two years later; Pavel never responded again to this PR or the other, and I found myself needing npy files in rust once again, so I finally came back to work on my fork and have now published it under the name npyz . In addition to the things from this PR, it also has support for:

io::Read/io::Write
n-dimensional arrays (:o !!!)
num-complex

It can be found here: https://github.com/ExpHP/npyz

Hopefully one day these additions can be merged upstream, so that they can finally be under the name npy where people are most likely to look for them; but I've more or less given up on that by this point.

potocpav / npy-rs