Closed sadikovi closed 6 years ago
@sunchao I refactored column writer and page writer. I am open to any suggestions or API changes, and would be glad if you could review the code. Thanks!
While writing tests, I found that it could be challenging to create a page writer, pass it to column writer as Box and then check the content of page writer, it complains about Box having static lifetime, but the mutable reference that I pass into page writer does not live long enough. My test was going to write all of the data into page writer, then get that buffer and use it to read pages in page reader.
Would appreciate if you have any suggestions on how to solve this problem.
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
column/page.rs | 3 | 96.43% | ||
file/writer.rs | 5 | 97.67% | ||
encodings/encoding.rs | 30 | 94.82% | ||
<!-- | Total: | 38 | --> |
Totals | |
---|---|
Change from base Build 580: | -0.04% |
Covered Lines: | 11916 |
Relevant Lines: | 12489 |
I had some issues with Box
s and lifetimes in my PR when testing the code. I am thinking if it is worth revisiting all APIs (external and internal) and how we pass things between file reader/writer, row group reader/writer, column reader/writer and page reader/writer; and opening an issue on it.
@sunchao thanks! I am also working on unit tests at the moment. Will address your comments ASAP. Cheers!
@sunchao I addressed most of your review comments. Unresolved ones are about making PageWriteSpec internal, and Data Page v2 optimisations. Let me know how we need to proceed with those and would appreciate if you could have another look. Thanks!
@sunchao I updated this PR with new level encoders and added tests. Can you review again? Thanks!
I also think there are quite a few unnecessary byte copies, but I am not sure how critical those are. Would appreciate if you could also have a look at write batch and add data page methods and let me know. Thanks!
Thanks @sadikovi ! will take a look at this soon.
Honestly, I still think there is so much work to do to make this code somehow efficient... I wrote unit tests with writing into a temp file and reading it back into column reader - works so far, but it is a bit hacky :). I also added is_sorted
method to the DictEncoder, which returns false
for now.
I am still not quite sure about the actual functional differences between writer version 1 and version 2. I am treating them as data pages v1 and data pages v2, but it could be something else.
Anyway, it should be okay for now.
Thanks @sadikovi . Yes we can improve the unit tests and implement the is_sorted
logic later. Looking at the parquet-mr, I think we are doing the right way - but we can always fix it later if it doesn't :)
Sure, thanks for review and merge!
This PR adds column writer
ColumnWriter
and its typed implementationColumnWriterImpl
, which is similar to column reader in this regard. Column writer supports writing data page v1 and v2 and dictionary.PR only adds column writer and updates page writer, tests will be added in the follow-up PR, because of the amount of code to review.
I updated
PageWriter
to only have 3 methods:write_page
, writes page and returns metrics for a page inPageWriteSpec
.write_metadata
, writes column metadata into sink once.close
, closes page writer and releases resources, if any.This allows us to avoid maintaining state in page writer, all of the previous logic is moved into column writer.
The way column writer works in as follows:
write_batch
,get_total_bytes_written
,get_total_rows_written
,get_column_metadata
,close
.write_batch
is called, we split batch into smaller batches and callwrite_mini_batch
to write data, this ensures fine-grained page writes.write_mini_batch
we check if we have enough data for a page, if so, we calladd_data_page
method to create a new compressed data page. This encodes values, levels, etc. If dictionary encoding is enabled, created page is added to an in-memory queue, otherwise it is sent to page writer directly.write_mini_batch
after we added page, we check if dictionary is enabled and it exceeds the dictionary limit. If so, we create a dictionary page and write it to a page writer. Then we write all of the pages from in-memory buffer into page writer.close
, all remaining data is converted into a data page and written to page writer. If dictionary is still not full, we create dictionary page and flush all of the data pages.