High Level Record Writer

xrl commented 6 years ago

This issue has been mentioned in multiple tickets including #174. I'd like to have a tracking issue for design of a Record Writer. I was playing around with a procedural macro design, something which could support #[derive(ParquetRecord)] or similar name.

I'd like to support all the pointer/non-pointer values inside of the record struct: String, &String, &str, Option<&str>, Option<&String>, &Option<&str>, &Option<&String>, &Option<String>. These ownerships styles came up often when loading data from Diesel, sometimes I had an owned string, sometimes I had a computed optional which yielded a borrowed string, etc.

So a sample struct:

#[derive(ParquetRecord)]
struct CoolDataForParquet<'a> {
  owned_val: String,
  borrowed_val: &'a Option<&'a str>,
  computed_borrowed_val: Option<&'a str>,
}

that would derive an implementation on struct which writes those fields in the order they are defined: owned_val, borrowed_val, computed_borrowed_val.

Now we need a record writer method on a RowGroup:

let records = ... // the user does all their work for this
let parquet_file = ...
row_group = parquet_file.next_row_group().unwrap();
for record in records {
  row_group.write_record(record)
}

where RowGroup#write_record is something like:

fn write_record(&self, r: ParquetRecord) {
  for (file_col, record_val) in self.columns.zip(r.values) {
    file_col.write(record_val);
  }
}

and file_col would implement the interface ColumnEasyWriter (these names are total stand-ins btw):

trait ColumnEasyWriter {
  write(&self, val: ColumnEasyValue)
}

and then we build out all the implementations of ColumnEasyValue for the variations of String, &Option<&str>, etc.

Figuring out the responsibilities for enumerating columns, dispatching writes, and keeping the number of traits to a minimum sounds tough! This is part I feel weakest about.

Other open questions:

[ ] Could we build the CoolDataForParquet from the schema string? Think a macro like parquet_record_writer!(schema message { REQUIRED BINARY owned_value (UTF8), ... })? Then the struct and schema are kept in sync and we get more type safety?
[ ] Could this be done with more Iterator<Item=...> kind of code? Less intermediate vectors could be good.

xrl commented 6 years ago

@sadikovi what do you think of this design? you mentioned you were going to work on a high-level record writer and I was curious if this design is in line with what you wanted.

sadikovi commented 6 years ago

Thanks for writing the comment. I have been snowed under with the current project, so apologies for that.

I quite like your idea, it is a bit different from mine. Would like to see it done, it could be a performant solution.

I was going to reuse our Row API for values, mainly because we already have it.

We can add macros to help users write less code, including file creation.

When it comes to the actual value writing, I was planning to reuse record reader technique for writes with triplet iterators and value readers, well, in this case value writers. But yes, I agree, it could get complicated.

Let me know what you think. I will try doing at least something this weekend, I have been a bit off the project for the last two weeks.

sunchao / parquet-rs

High Level Record Writer #192