potocpav / npy-rs

NumPy file format (de-)serialization in Rust
30 stars 7 forks source link

Read and write unstructured arrays #8

Closed milibopp closed 6 years ago

milibopp commented 6 years ago

Addresses #4.

This turned out more radical a change than I anticipated. Most notably, I made the following backwards-incompatible changes:

I did increase unit test coverage in those part I worked on. To do so some refactorings were made along the way to test at the module-level rather than end-to-end.

At the moment, this branch is not ready to be merged. First of all, I skipped a lot of error handling (see all instances of unimplemented!()). Furthermore, I would suggest merging RecordDType and DType into a single enum representing all dtypes.

Before taking this further, I would like to get some feedback on the general strategy though.

potocpav commented 6 years ago

Thanks a lot for the work you are putting into this. Don't worry about backwards-incompatible changes, the API was not ideal to begin with :) Unfortunately, I am busy right now. I will take a look at the changes tomorrow in the evening.

potocpav commented 6 years ago

Sorry for the late reply. I like the changes you made. After a clean-up (docs, unimplemented!()), I will merge.

As it is, eliminating the RecordDType/DType distinction is only logical; we should definitely get rid of the DType.shape field in the Simple case. Did you have something like this in mind?

pub enum DType {
    Simple(String),  // Numpy type string
    Structured(Vec<(String, DType, Vec<u64>)>),
}

Recurrent DType also prepares #7, which is nice.

milibopp commented 6 years ago

Thank you for looking over the changes. Glad you like the approach.

And yes, that kind of DType is what I had in mind. I hope to finish this up some time over the next few days.

milibopp commented 6 years ago

By now, I have added the error handling code and merged the two dtypes. All that is left, is to move the shape from the Plain variant to the Record variant. Perhaps I could promote the field metadata to a separate struct to make it easier to read.

For example:

enum DType {
    Plain(String),
    Record(Vec<Field>)
}

struct Field {
    name: String,
    dtype: DType,
    shape: Vec<u64>
}

The variants were called Simple and Structured before. I found the new names more fitting. Open to bikeshedding on that issue.

On another note, I intend to continue this work, until both #4 and #7 are resolved. But at some point it might make sense to open a separate PR for the remaining work. I would be okay with merging even at the current stage and leave the rest up to the resolution of #7.

potocpav commented 6 years ago

Naming - Plain and Record seem fine. I'm not big on bikeshedding :) The struct Field looks nicer than a tuple when accessing it's fields, so I'm for it.

Would you move shape into Record and split out the struct Field before merging? This completes #4 apart from docs. Follow-up PR(s) would then solve #7 and update docs & examples.

milibopp commented 6 years ago

Sure, will do. Can you point me to what part of the documentation needs updating for this PR?

potocpav commented 6 years ago

I would like a simple example of deserializing to a primitive type, it is probably a number-one use-case. Maybe stick it in a README too.

In lib.rs, there is

Only one-dimensional [structured arrays](https://docs.scipy.org/doc/numpy/user/basics.rec.html) are
supported at the moment as they map well to Rust structs.

which is no longer true.

milibopp commented 6 years ago

Okay, so I tried to move the shape field…

tl;dr: this should not be done, first of all, because [T; N] does not work that way. As an alternative one could keep the separation between primitive dtypes and structured ones, but this would make #7 harder. In fact, after going through this, the current design appears cleaner to me.

Long story

If we move the shape field to the record variant, there is no longer a sensible DType for [T; N]. That makes it impossible to implement Serializable for arrays. I think this is a very strong reason not to go that route.

There is one thing to be aware of, if we leave it like it is – in theory, one could create arrays of "plain" types that have a non-scalar shape. For reference, Numpy mitigates such a situation by eliding the shapes of the dtype and the array:

>>> numpy.zeros(dtype=numpy.dtype('(2,2)f8'), shape=(3,)).shape
(3, 2, 2)

I said "in theory", because there is no support for multi-dimensional arrays in this crate yet. However, I can imagine this to be actually quite useful. As an example, suppose you have a thin 2D array of shape (1000, 3) in an npy file, a sequence of 1000 vectors in 3D space. You want to represent the vectors as [f32; 3]. Therefore you try to read it as NpyData<[f32; 3]>.

Now, in Python this is a non-issue, as everything is dynamic anyway. Thus it makes sense for Numpy to just eliminate this degree of freedom. In Rust, on the other hand, you may want a bit more control over your datatypes.

So, even though this is not implemented yet, I would suggest leaving the DType representation more or less as is:

enum DType {
    Plain { ty: String, shape: Vec<u64> },
    Record(Vec<Field>)
}

struct Field {
    name: String,
    dtype: DType
}

This is the most flexible representation and I do not think it is necessarily bad to allow something like DType::Plain { ty: "<f8", shape: vec![3] } at the top-level.

I checked what the current implementation on this branch does, if we add an implementation of NpyRecord for [T; n]. It does not work for reading at all. And it does something very odd when writing. So I added a check to prevent that.

milibopp commented 6 years ago

Just added the documentation and examples, too. Pending the above mentioned design decision, this could be merged imho.

potocpav commented 6 years ago

LGTM. Thanks for the great work you put into this!

I agree with your reasoning about the shape field. The multi-dimensional arrays can be enabled later.