sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Add file writer and row group writer #149

Closed sadikovi closed 6 years ago

sadikovi commented 6 years ago

This PR is final part of the core implementation of Parquet writes, adds:

I tried to maintain a fairly simple API. The idea and steps are as follows:

  1. Create file writer based on file std::fs::File, schema TypePtr and writer properties WriterPropertiesPtr that you want to write.
  2. Call file_writer.next_row_group to get a mutable reference to a row group writer to add data.
  3. Call row_group_writer.next_column to add values to a column. Keep iterating until all columns have been written - we do not support partially written row group!
  4. After that you can call close() on column writer and/or row group writer. They will be closed regardless, before either the next column is requested or the next row group is requested.
  5. Keep adding row groups until you have added enough.
  6. Call file_writer.close to write final metadata to a file.
  7. That is it, after that you should have a complete file, so you can check schema and read records from it.

I added a set of unit tests. Please review them as well, they show API in action.

sadikovi commented 6 years ago

@sunchao Could you review this PR when you have time. I have several high-level questions and workarounds that I would like to discuss below:

Let me know what you think. Thanks!

coveralls commented 6 years ago

Pull Request Test Coverage Report for Build 612


Files with Coverage Reduction New Missed Lines %
column/writer.rs 3 95.93%
file/reader.rs 13 96.97%
file/writer.rs 20 95.91%
<!-- Total: 36 -->
Totals Coverage Status
Change from base Build 610: 0.1%
Covered Lines: 12359
Relevant Lines: 12927

💛 - Coveralls
sunchao commented 6 years ago

@sadikovi : really sorry that I haven't got chance to look at this yet. Will take a look today.

sadikovi commented 6 years ago

No worries, take as much time as needed. On Tue, 21 Aug 2018 at 6:37 PM, Chao Sun notifications@github.com wrote:

@sadikovi https://github.com/sadikovi : really sorry that I haven't got chance to look at this yet. Will take a look today.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/pull/149#issuecomment-414739703, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3vuyRRb6CI-LUG8TIF2iTgpCybYVks5uTDdegaJpZM4V_Uzh .

sadikovi commented 6 years ago

@sunchao I addressed your comments, but might have missed some. Would you mind doing another review pass, including tests (I feel they are a bit overcomplicated)? And could you also give your thoughts on the questions I raised in my first comment?

Thanks!

sadikovi commented 6 years ago

@sunchao I made the changes to the API as per your suggestion. Can you have a look? Thanks!

Now we have next_XYZ and close_XYZ, we will raise an error if next method is called before close method is called for the previous XYZ.

Let me know if I should add more unit tests.

sunchao commented 6 years ago

@sadikovi : the latest PR looks good to me in general. However, the test test_column_writer_check_metadata is failing. Can you fix that?

sadikovi commented 6 years ago

Yes, thanks. I just checked the tests, will fix shortly, make some other minor changes and ping you again. On Fri, 31 Aug 2018 at 10:46 AM, Chao Sun notifications@github.com wrote:

@sadikovi https://github.com/sadikovi : the latest PR looks good to me in general. However, the test test_column_writer_check_metadata is failing. Can you fix that?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/pull/149#issuecomment-417597149, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3uJ5jEm7WWh0bZ7u1q2WaJObELlLks5uWPfXgaJpZM4V_Uzh .

sadikovi commented 6 years ago

@sunchao I fixed the test - it failed because I added test on encodings in this branch, but in master we have updated to add levels encoding as well, so it was failing with missing encoding for levels.

Should be all good now, would appreciate if you could look over the code again. There is also an open question about limiting row groups somehow, either based on size or number of rows - currently we do not limit them in any way. We could also add this as a follow-up issue.

sunchao commented 6 years ago

Should be all good now, would appreciate if you could look over the code again. There is also an open question about limiting row groups somehow, either based on size or number of rows - currently we do not limit them in any way. We could also add this as a follow-up issue.

Yes we should have some config properties for this. Good for a follow-up. :)

sunchao commented 6 years ago

Merged. Thanks @sadikovi !

sadikovi commented 6 years ago

Thanks! On Fri, 31 Aug 2018 at 8:48 PM, Chao Sun notifications@github.com wrote:

Merged. Thanks @sadikovi https://github.com/sadikovi !

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/pull/149#issuecomment-417757558, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3s1WzMAaaSJdyY6K9K4a31reIExWks5uWYTtgaJpZM4V_Uzh .