sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Derive parquet schema from struct #203

Open xrl opened 5 years ago

xrl commented 5 years ago

If users write to their parquet files through an intermediate struct, we can help them out by generating the parquet schema from the struct.

I like to model rows of my parquet file using structs, for example:

struct PurchaseOrderRecord<'a> {
    id: i32,
    ad_po_number: &'a Option<String>
}

and then I have to manually track the schema, writing something by hand like:

lazy_static! {
    static ref purchase_orders_schema: &'static str = "message schema {
REQUIRED INT32 id;
OPTIONAL BINARY ad_po_number (UTF8);
    }";
}

and any time I make a change to the PurchaseOrderRecord I have to manually update purchase_orders_schema or else I get runtime errors.

We can avoid this whole situation by providing a deriving procedural macro. I was thinking something name ParquetSchema, to be used:

#[derive(ParquetSchema)]
struct PurchaseOrderRecord<'a> {
  ...
}

which would derive a value and an accessor trait. With the macro fully expanded you would get something like:

struct PurchaseOrderRecord<'a> {
  ...
}
lazy_static! {
  static ref purchase_order_schema: parquet::schema::types::Type = ...
}

what's interesting here is that I can build the concrete schema enum at compile time.

This functionality would remove error prone steps for writers/schemas. This is a big pain point for me 😄.

The dream would be to enable functionality like:

#[derive(ParquetSchema,ParquetRecordWriter)]
struct PurchaseOrderRecord<'a> {
  ...
}

and then users can focus on their data and the parquet stuff is taken care of!

Also, I glossed it over, but we may want some kind of schema accessor trait to map a struct type to the macro-generated static schema type enum:

trait Schema {
  pub schema() -> &'static parquet::schema::types::Type;
}

which would allow the user to access the schema anywhere with:

PurchaseOrderRecord::schema()