Add column writer - Githubissues

sadikovi commented 6 years ago

This PR adds column writer ColumnWriter and its typed implementation ColumnWriterImpl, which is similar to column reader in this regard. Column writer supports writing data page v1 and v2 and dictionary.

PR only adds column writer and updates page writer, tests will be added in the follow-up PR, because of the amount of code to review.

I updated PageWriter to only have 3 methods:

write_page, writes page and returns metrics for a page in PageWriteSpec.
write_metadata, writes column metadata into sink once.
close, closes page writer and releases resources, if any.

This allows us to avoid maintaining state in page writer, all of the previous logic is moved into column writer.

The way column writer works in as follows:

Column writer is created using column descriptor (mainly for max definition and repetition levels), writer properties to get various limits and page writer.
Column writer provides a set of public methods: write_batch, get_total_bytes_written, get_total_rows_written, get_column_metadata, close.
Every time write_batch is called, we split batch into smaller batches and call write_mini_batch to write data, this ensures fine-grained page writes.
Within write_mini_batch we check if we have enough data for a page, if so, we call add_data_page method to create a new compressed data page. This encodes values, levels, etc. If dictionary encoding is enabled, created page is added to an in-memory queue, otherwise it is sent to page writer directly.
Within write_mini_batch after we added page, we check if dictionary is enabled and it exceeds the dictionary limit. If so, we create a dictionary page and write it to a page writer. Then we write all of the pages from in-memory buffer into page writer.
After dictionary fallback, all pages are written directly into page writer.
When we call close, all remaining data is converted into a data page and written to page writer. If dictionary is still not full, we create dictionary page and flush all of the data pages.

Note that we call flush_data_pages in dictionary fallback. This method will write another page depending on the values left in the buffer, but these values are not accounted when deciding if dictionary is full. So, theoretically, dictionary will be slightly larger than the set limit.

sadikovi commented 6 years ago

@sunchao I refactored column writer and page writer. I am open to any suggestions or API changes, and would be glad if you could review the code. Thanks!

While writing tests, I found that it could be challenging to create a page writer, pass it to column writer as Box and then check the content of page writer, it complains about Box having static lifetime, but the mutable reference that I pass into page writer does not live long enough. My test was going to write all of the data into page writer, then get that buffer and use it to read pages in page reader.

Would appreciate if you have any suggestions on how to solve this problem.

coveralls commented 6 years ago

Pull Request Test Coverage Report for Build 581

0 of 0 changed or added relevant lines in 0 files are covered.
38 unchanged lines in 3 files lost coverage.
Overall coverage decreased (-0.04%) to 95.412%

Files with Coverage Reduction	New Missed Lines	%
column/page.rs	3	96.43%
file/writer.rs	5	97.67%
encodings/encoding.rs	30	94.82%
<!--	Total:	38	-->

Totals
Change from base Build 580:	-0.04%
Covered Lines:	11916
Relevant Lines:	12489

💛 - Coveralls

sadikovi commented 6 years ago

I had some issues with Boxs and lifetimes in my PR when testing the code. I am thinking if it is worth revisiting all APIs (external and internal) and how we pass things between file reader/writer, row group reader/writer, column reader/writer and page reader/writer; and opening an issue on it.

sadikovi commented 6 years ago

@sunchao thanks! I am also working on unit tests at the moment. Will address your comments ASAP. Cheers!

sadikovi commented 6 years ago

@sunchao I addressed most of your review comments. Unresolved ones are about making PageWriteSpec internal, and Data Page v2 optimisations. Let me know how we need to proceed with those and would appreciate if you could have another look. Thanks!

sadikovi commented 6 years ago

@sunchao I updated this PR with new level encoders and added tests. Can you review again? Thanks!

I also think there are quite a few unnecessary byte copies, but I am not sure how critical those are. Would appreciate if you could also have a look at write batch and add data page methods and let me know. Thanks!

sunchao commented 6 years ago

Thanks @sadikovi ! will take a look at this soon.

sadikovi commented 6 years ago

Honestly, I still think there is so much work to do to make this code somehow efficient... I wrote unit tests with writing into a temp file and reading it back into column reader - works so far, but it is a bit hacky :). I also added is_sorted method to the DictEncoder, which returns false for now.

I am still not quite sure about the actual functional differences between writer version 1 and version 2. I am treating them as data pages v1 and data pages v2, but it could be something else.

Anyway, it should be okay for now.

sunchao commented 6 years ago

Thanks @sadikovi . Yes we can improve the unit tests and implement the is_sorted logic later. Looking at the parquet-mr, I think we are doing the right way - but we can always fix it later if it doesn't :)

sadikovi commented 6 years ago

Sure, thanks for review and merge!

sunchao / parquet-rs

Add column writer #138

Pull Request Test Coverage Report for Build 581

💛 - Coveralls