Open xrl opened 5 years ago
If users write to their parquet files through an intermediate struct, we can help them out by generating the parquet schema from the struct.
I like to model rows of my parquet file using structs, for example:
struct PurchaseOrderRecord<'a> { id: i32, ad_po_number: &'a Option<String> }
and then I have to manually track the schema, writing something by hand like:
lazy_static! { static ref purchase_orders_schema: &'static str = "message schema { REQUIRED INT32 id; OPTIONAL BINARY ad_po_number (UTF8); }"; }
and any time I make a change to the PurchaseOrderRecord I have to manually update purchase_orders_schema or else I get runtime errors.
PurchaseOrderRecord
purchase_orders_schema
We can avoid this whole situation by providing a deriving procedural macro. I was thinking something name ParquetSchema, to be used:
ParquetSchema
#[derive(ParquetSchema)] struct PurchaseOrderRecord<'a> { ... }
which would derive a value and an accessor trait. With the macro fully expanded you would get something like:
struct PurchaseOrderRecord<'a> { ... } lazy_static! { static ref purchase_order_schema: parquet::schema::types::Type = ... }
what's interesting here is that I can build the concrete schema enum at compile time.
This functionality would remove error prone steps for writers/schemas. This is a big pain point for me 😄.
The dream would be to enable functionality like:
#[derive(ParquetSchema,ParquetRecordWriter)] struct PurchaseOrderRecord<'a> { ... }
and then users can focus on their data and the parquet stuff is taken care of!
Also, I glossed it over, but we may want some kind of schema accessor trait to map a struct type to the macro-generated static schema type enum:
trait Schema { pub schema() -> &'static parquet::schema::types::Type; }
which would allow the user to access the schema anywhere with:
PurchaseOrderRecord::schema()
If users write to their parquet files through an intermediate struct, we can help them out by generating the parquet schema from the struct.
I like to model rows of my parquet file using structs, for example:
and then I have to manually track the schema, writing something by hand like:
and any time I make a change to the
PurchaseOrderRecord
I have to manually updatepurchase_orders_schema
or else I get runtime errors.We can avoid this whole situation by providing a deriving procedural macro. I was thinking something name
ParquetSchema
, to be used:which would derive a value and an accessor trait. With the macro fully expanded you would get something like:
what's interesting here is that I can build the concrete schema enum at compile time.
This functionality would remove error prone steps for writers/schemas. This is a big pain point for me 😄.
The dream would be to enable functionality like:
and then users can focus on their data and the parquet stuff is taken care of!
Also, I glossed it over, but we may want some kind of schema accessor trait to map a struct type to the macro-generated static schema type enum:
which would allow the user to access the schema anywhere with: