potocpav / npy-rs

NumPy file format (de-)serialization in Rust
30 stars 7 forks source link

Read only part of array fields #6

Closed milibopp closed 6 years ago

milibopp commented 6 years ago

Here is a feature, which I would find useful: say, I have a data file that contains fields a, b, c. Now I do not care about c at all. Would it be possible to just define a struct like the following

struct Partial {
    a: f64,
    b: i8
}

and still have it parse successfully?

potocpav commented 6 years ago

I think this could be done along these lines. The responsibility of picking the right byte offsets for deserialization would fall on NpyData not NpyRecord. NpyData would contain a new field offsets with the information about positions of the "useful" variables, and record_size with the total size of a single serialized record:

struct Tree<T> (Vec<(T, Tree<T>)>);

pub struct NpyData<'a, T> {
    data: &'a [u8],
    offsets: Tree<usize>,
    record_size: usize,
    n_records: usize,
    _t: PhantomData<T>,
} 

NpyRecord's read function type would need to be changed and n_bytes would be superfluous:

pub trait NpyRecord : Sized {
    fn dtype() -> DType;
    fn read(&[u8], offsets: &Tree<usize>) -> Self;
    fn write<W: Write>(&self, writer: &mut W) -> Result<()>;
}

write is fine as is, we wouldn't save values not mentioned in the struct.


These changes seem to nearly enable implementing NpyRecord for structs with Vec<> fields with arbitrary length:

#[derive(NpyRecord)]
struct S {
    field: Vec<f32>,
}

We would only need to also pass shape information to the read function and somehow ensure that serialization fails on vectors of different lengths.

There may be some problems with this approach, or performance penalties. I'm not sure.

milibopp commented 6 years ago

That sounds like it adds a fair amount of complexity. Not sure whether it is worth it in that case. There is an easy workaround for the user, after all. You can simply define the full struct to read and cast it to any subset you want.

potocpav commented 6 years ago

Agree, since this is not easy to do and presents some performance tradeoffs, I'm closing the issue.