sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Make SerializedFileReader generic to any Reader #135

Closed gnieto closed 6 years ago

gnieto commented 6 years ago

Generalized FileReader in order to allow multiple inner readers. This will allow to be able to parse from an in-memory buffer, for example (via Cursor<&[u8]>, for example).

I've also parametrized FileReader with PagerReader and RowGroupReader, which avoids boxing and some dynamic dispatching.

There are some details pending to be added, but i wanted to know what do you think about this change. My goal was fuzz the crate, and I think it will be helpful that the library can parse from an arbitrary reader (this change is not enough for fuzzing, as there are some 'static lifetimes that prevents me to fuzz it)

I realize that the signatures are very verbose now, but I think that with a little more of work they can be simplified (for example, seems that PageReader or RowGroupReader may be converted into structs, now that are generic to the reader).

coveralls commented 6 years ago

Pull Request Test Coverage Report for Build 567


Totals Coverage Status
Change from base Build 564: 0.0%
Covered Lines: 11364
Relevant Lines: 11921

💛 - Coveralls
gnieto commented 6 years ago

Is there a significant performance impact from boxing and some dynamic dispatching, in particular FileReader and RowGroupReader? Based on what I have seen so far, this is the only way of returning a generic value, unless you want to use enums, but I can be wrong.

It has a cost. It it's significative or not, depend on the amount of times the code is called. Usually, I use a Box and dynamic dispatching when the return type needs to be resolved at runtime (for example, see get_decoder). On the other hand, look at FileReader::get_row_group: It always return SerializedRowGroupReader, so no boxing should be needed.

If going the PR path, we would not need FileReader trait at all, just use struct with generics to specify what reader uses internally.

I will try this. Thanks.

Have you considered passing input stream instance Box that could be based on a File or Cursor instead? IMHO, this could result in fewer changes overall. For this change, we would have to add more tests to make sure that FileReader works, which is a larger undertaking than making sure that input stream behaviour works for a vector or a file.

I considered, but i prefer to make it generic. It's true that types gets more verbose. But I think that this can be simplified even more (I didn't want to introduce more changes, but I can try to simplify it a little bit more). And I think that code is (a little bit) more idiomatic by using generics instead of boxing.

Also SerializedFileWriter will have to rely on File as an input, because of certain semantics; it might result in confusion between FileWriter and FileReader, but this is a minor thing.

I don't find SerializedFileWriter on the project. I'm not sure that I understand you.


Finally, thanks for taking time to review this. For sure, I will need to add tests, but I wanted to get early feedback and check if you are ok with it before going further.

sadikovi commented 6 years ago

Okay, it looks like this PR is trying to solve two problems:

Would it be possible to address them separately somehow? I would appreciate it, and it would help to clearly see what is added and why.

FYI: We are working on write support, so SerializedFileWriter will be added as part of that work.

gnieto commented 6 years ago

Yes, I can open two separate PRs. In fact, I squashed this before opening the PR, so it will easy to split this again :)

I'll open a new PR with the generics part once I've some time.

sunchao commented 6 years ago

Thanks @gnieto ! Agree with @sadikovi that it's better to separate this into two PRs. I like the change on ParquetReader since in future we'll add more input sources such as HDFS file. Will take a look at the PR for making FileReader generic.