sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

High Level Record Writer #192

Open xrl opened 5 years ago

xrl commented 5 years ago

This issue has been mentioned in multiple tickets including #174. I'd like to have a tracking issue for design of a Record Writer. I was playing around with a procedural macro design, something which could support #[derive(ParquetRecord)] or similar name.

I'd like to support all the pointer/non-pointer values inside of the record struct: String, &String, &str, Option<&str>, Option<&String>, &Option<&str>, &Option<&String>, &Option<String>. These ownerships styles came up often when loading data from Diesel, sometimes I had an owned string, sometimes I had a computed optional which yielded a borrowed string, etc.

So a sample struct:

#[derive(ParquetRecord)]
struct CoolDataForParquet<'a> {
  owned_val: String,
  borrowed_val: &'a Option<&'a str>,
  computed_borrowed_val: Option<&'a str>,
}

that would derive an implementation on struct which writes those fields in the order they are defined: owned_val, borrowed_val, computed_borrowed_val.

Now we need a record writer method on a RowGroup:

let records = ... // the user does all their work for this
let parquet_file = ...
row_group = parquet_file.next_row_group().unwrap();
for record in records {
  row_group.write_record(record)
}

where RowGroup#write_record is something like:

fn write_record(&self, r: ParquetRecord) {
  for (file_col, record_val) in self.columns.zip(r.values) {
    file_col.write(record_val);
  }
}

and file_col would implement the interface ColumnEasyWriter (these names are total stand-ins btw):

trait ColumnEasyWriter {
  write(&self, val: ColumnEasyValue)
} 

and then we build out all the implementations of ColumnEasyValue for the variations of String, &Option<&str>, etc.

Figuring out the responsibilities for enumerating columns, dispatching writes, and keeping the number of traits to a minimum sounds tough! This is part I feel weakest about.

Other open questions:

xrl commented 5 years ago

@sadikovi what do you think of this design? you mentioned you were going to work on a high-level record writer and I was curious if this design is in line with what you wanted.

sadikovi commented 5 years ago

Thanks for writing the comment. I have been snowed under with the current project, so apologies for that.

I quite like your idea, it is a bit different from mine. Would like to see it done, it could be a performant solution.

I was going to reuse our Row API for values, mainly because we already have it.

We can add macros to help users write less code, including file creation.

When it comes to the actual value writing, I was planning to reuse record reader technique for writes with triplet iterators and value readers, well, in this case value writers. But yes, I agree, it could get complicated.

Let me know what you think. I will try doing at least something this weekend, I have been a bit off the project for the last two weeks.