Closed sadikovi closed 6 years ago
@sunchao Could you review this PR when you have time. I have several high-level questions and workarounds that I would like to discuss below:
close
takes &mut self
now instead of mut self
. My goal was mainly around the fact that I can store a reference to a current column writer in row group writer and call close()
multiple times (user might call close, and we need to ensure that also close column writer in case she did not). Is it okay or should I look for other ways of implementing it somehow?DEFAULT_MAX_ROW_GROUP_SIZE
from WriterProperties
. Should we use it? Let me know what you think. Thanks!
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
column/writer.rs | 3 | 95.93% | ||
file/reader.rs | 13 | 96.97% | ||
file/writer.rs | 20 | 95.91% | ||
<!-- | Total: | 36 | --> |
Totals | |
---|---|
Change from base Build 610: | 0.1% |
Covered Lines: | 12359 |
Relevant Lines: | 12927 |
@sadikovi : really sorry that I haven't got chance to look at this yet. Will take a look today.
No worries, take as much time as needed. On Tue, 21 Aug 2018 at 6:37 PM, Chao Sun notifications@github.com wrote:
@sadikovi https://github.com/sadikovi : really sorry that I haven't got chance to look at this yet. Will take a look today.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/pull/149#issuecomment-414739703, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3vuyRRb6CI-LUG8TIF2iTgpCybYVks5uTDdegaJpZM4V_Uzh .
@sunchao I addressed your comments, but might have missed some. Would you mind doing another review pass, including tests (I feel they are a bit overcomplicated)? And could you also give your thoughts on the questions I raised in my first comment?
Thanks!
@sunchao I made the changes to the API as per your suggestion. Can you have a look? Thanks!
Now we have next_XYZ
and close_XYZ
, we will raise an error if next method is called before close method is called for the previous XYZ.
Let me know if I should add more unit tests.
@sadikovi : the latest PR looks good to me in general. However, the test test_column_writer_check_metadata
is failing. Can you fix that?
Yes, thanks. I just checked the tests, will fix shortly, make some other minor changes and ping you again. On Fri, 31 Aug 2018 at 10:46 AM, Chao Sun notifications@github.com wrote:
@sadikovi https://github.com/sadikovi : the latest PR looks good to me in general. However, the test test_column_writer_check_metadata is failing. Can you fix that?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/pull/149#issuecomment-417597149, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3uJ5jEm7WWh0bZ7u1q2WaJObELlLks5uWPfXgaJpZM4V_Uzh .
@sunchao I fixed the test - it failed because I added test on encodings in this branch, but in master we have updated to add levels encoding as well, so it was failing with missing encoding for levels.
Should be all good now, would appreciate if you could look over the code again. There is also an open question about limiting row groups somehow, either based on size or number of rows - currently we do not limit them in any way. We could also add this as a follow-up issue.
Should be all good now, would appreciate if you could look over the code again. There is also an open question about limiting row groups somehow, either based on size or number of rows - currently we do not limit them in any way. We could also add this as a follow-up issue.
Yes we should have some config properties for this. Good for a follow-up. :)
Merged. Thanks @sadikovi !
Thanks! On Fri, 31 Aug 2018 at 8:48 PM, Chao Sun notifications@github.com wrote:
Merged. Thanks @sadikovi https://github.com/sadikovi !
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/pull/149#issuecomment-417757558, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3s1WzMAaaSJdyY6K9K4a31reIExWks5uWYTtgaJpZM4V_Uzh .
This PR is final part of the core implementation of Parquet writes, adds:
FileWriter
and its serialised versionSerializedFileWriter
.RowGroupWriter
and its serialised versionSerializedRowGroupWriter
.I tried to maintain a fairly simple API. The idea and steps are as follows:
std::fs::File
, schemaTypePtr
and writer propertiesWriterPropertiesPtr
that you want to write.file_writer.next_row_group
to get a mutable reference to a row group writer to add data.row_group_writer.next_column
to add values to a column. Keep iterating until all columns have been written - we do not support partially written row group!close()
on column writer and/or row group writer. They will be closed regardless, before either the next column is requested or the next row group is requested.file_writer.close
to write final metadata to a file.I added a set of unit tests. Please review them as well, they show API in action.