sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Parquet Output Filesize #211

Open jcgomes90 opened 5 years ago

jcgomes90 commented 5 years ago

I have a program that is writing out to a parquet file. Although parquet is a columnar data storage, I am writing row by row. I can understand this being a hit on performance since the library allows columnar bulk writes.

My concern is the file size. When I view the parquet file using the Apache reader, everything looks fine. But opening the file in text editor, it looks like the column title is being written for every column for every row. Is there a configuration option or something that I am missing? The files are much bigger in size comparatively to other parquet files iv seen with a lot more rows than what I have with the same schema.

jcgomes90 commented 5 years ago

The file size looks to be coming from the fact that the library is writing the column header every time we are writing a row.

sunchao commented 5 years ago

Sorry for the late reply. Have you resolved the issue? If not, can you share the code which does the writing? You should write multiple rows in each row group.

jcgomes90 commented 5 years ago

I am reading rows as they come (real-time). My schema looks something like this:

let message_type = "message schema
{
     OPTIONAL BOOLEAN a;
     OPTIONAL INT64 b;
     OPTIONAL BOOLEAN c;
}

I would call a write_data function which I call for every row which looks something like this:

let mut row_group_writer = serialized_writer.next_row_group().unwrap();

while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
     match col_writer {
           ColumnWriter::Int64ColumnWriter(ref mut typed_writer) =>
                 typed_writer.write_batch(...).unwrap();

           ColumnWriter::BoolColumnWriter ...

           ColumnWriter::ByteArrayColumnWriter ...
     }
     row_group_writer.close_column(col_writer).unwrap();
}
serialized_writer.close_row_group(row_group_writer).unwrap();

When I am done writing all the rows, I call a close_writing function which simply: serialized_writer.close();

So essentially, I am calling that write_data function for every row of data which looks to be adding the column headers for every row.

sunchao commented 5 years ago

Yes, it seems you are calling close_column and close_row_group for every row, which is not optimal. The latter will write the Parquet row group metadata to the file. Instead, you should keep writing and only close them until all rows (or a fixed number of rows, such as 1024) are written.

jcgomes90 commented 5 years ago

Thanks for the reply. Is there any way to get the last open row group from the row group writer? Maybe I can try closing the row group from the close_writing function.

jcgomes90 commented 5 years ago

If I am writing the parquet output row by row, it doesnt seem like it is possible to write multiple rows in one row group since the row is iterating through each of the columns.

brainstorm commented 5 years ago

Hi @jcgomes90, could you share a bit more of this parquet column-level write code? I'm about to write some code that does a "migration" from a parquet file towards another parquet file with two extra columns and would like to have some good working examples to base my work on.

I know that write row support is not there yet on parquet-rs but your code seems to be the closest to get there... performance is not a big issue in my case since this is a one-time transformation.

/cc @chris-zen

brainstorm commented 5 years ago

Oh, nevermind, I think I'll use parquet_derive (https://github.com/ccakes/parquet_derive) for now while https://github.com/apache/arrow/pull/4140 gets merged/worked on by @xrl, @sunchao et al 👍