sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Generic reader #142

Closed gnieto closed 6 years ago

gnieto commented 6 years ago

Generalized FileReader in order to allow multiple inner readers. This will allow to be able to parse from an in-memory buffer, for example (via Cursor<&[u8]>, for example).

Note that there are still some issues because of 'static lifetimes (I guess that it's still due to some type referencing DataType), which limits a little bit how it may be used. For example, this example does not work at the moment:

    let v = vec![0, 1, 2, 3];
    let c = Cursor::new(v.as_ref());
    let serialized = SerializedFileReader::new(c).unwrap();

    // compiler refuses to compile due to `v does not life enough`
    let iterator = serialized.get_row_iter(None).unwrap();

I've had to add some generic annotations. If you don't like it, I can Box it (but i think that it adds some usability issues).

coveralls commented 6 years ago

Pull Request Test Coverage Report for Build 583


Files with Coverage Reduction New Missed Lines %
errors.rs 1 26.67%
util/io.rs 1 98.92%
file/reader.rs 14 96.96%
<!-- Total: 16 -->
Totals Coverage Status
Change from base Build 580: 0.1%
Covered Lines: 11426
Relevant Lines: 11957

💛 - Coveralls
sadikovi commented 6 years ago

With all of these static problems, is it better going bottom up instead, e.g. from changing data types ('static problem) and internals to the parquet reader, and, in some not very distant future, writer. Feel free to open an issue if you think scope is large.

Looks good! Thanks for working on this.

gnieto commented 6 years ago

@sadikovi Do you mean to do a PR on your branch? If this gets merged, I can try to do it.

sunchao commented 6 years ago

@gnieto : sorry for the late review. This PR looks good to me.

For the lifetime issue, we can try to look into that but I think it may come with a cost. Loosing the static lifetime constraint may mean that we need to populate the lifetime parameter in many places, and reasoning about lifetime is not modular in Rust, so it's pretty tedious. In addition to that, it can also restrict expressive in another way (e.g., you may no longer store things as fields since the lifetime is bounded by the life of the stack).

IMO we should evaluate the trade off. For this case in particular, I'm not quite clear how useful it would be to be able to pass borrowed reference into Cursor. Also, we can use Rc in case that's really needed.

sunchao commented 6 years ago

Merged. Thanks @gnieto !