Closed gnieto closed 6 years ago
Is there a significant performance impact from boxing and some dynamic dispatching, in particular FileReader and RowGroupReader? Based on what I have seen so far, this is the only way of returning a generic value, unless you want to use enums, but I can be wrong.
It has a cost. It it's significative or not, depend on the amount of times the code is called. Usually, I use a Box and dynamic dispatching when the return type needs to be resolved at runtime (for example, see get_decoder
). On the other hand, look at FileReader::get_row_group
: It always return SerializedRowGroupReader
, so no boxing should be needed.
If going the PR path, we would not need FileReader trait at all, just use struct with generics to specify what reader uses internally.
I will try this. Thanks.
Have you considered passing input stream instance Box that could be based on a File or Cursor instead? IMHO, this could result in fewer changes overall. For this change, we would have to add more tests to make sure that FileReader
works, which is a larger undertaking than making sure that input stream behaviour works for a vector or a file.
I considered, but i prefer to make it generic. It's true that types gets more verbose. But I think that this can be simplified even more (I didn't want to introduce more changes, but I can try to simplify it a little bit more). And I think that code is (a little bit) more idiomatic by using generics instead of boxing.
Also SerializedFileWriter will have to rely on File as an input, because of certain semantics; it might result in confusion between FileWriter and FileReader, but this is a minor thing.
I don't find SerializedFileWriter
on the project. I'm not sure that I understand you.
Finally, thanks for taking time to review this. For sure, I will need to add tests, but I wanted to get early feedback and check if you are ok with it before going further.
Okay, it looks like this PR is trying to solve two problems:
Would it be possible to address them separately somehow? I would appreciate it, and it would help to clearly see what is added and why.
FYI: We are working on write support, so SerializedFileWriter
will be added as part of that work.
Yes, I can open two separate PRs. In fact, I squashed this before opening the PR, so it will easy to split this again :)
I'll open a new PR with the generics part once I've some time.
Thanks @gnieto ! Agree with @sadikovi that it's better to separate this into two PRs. I like the change on ParquetReader
since in future we'll add more input sources such as HDFS file. Will take a look at the PR for making FileReader
generic.
Generalized
FileReader
in order to allow multiple inner readers. This will allow to be able to parse from an in-memory buffer, for example (viaCursor<&[u8]>
, for example).I've also parametrized
FileReader
withPagerReader
andRowGroupReader
, which avoids boxing and some dynamic dispatching.There are some details pending to be added, but i wanted to know what do you think about this change. My goal was fuzz the crate, and I think it will be helpful that the library can parse from an arbitrary reader (this change is not enough for fuzzing, as there are some
'static
lifetimes that prevents me to fuzz it)I realize that the signatures are very verbose now, but I think that with a little more of work they can be simplified (for example, seems that
PageReader
orRowGroupReader
may be converted into structs, now that are generic to the reader).