sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

parquet_derive for the new RecordWriter trait #197

Closed xrl closed 5 years ago

xrl commented 5 years ago

Goal

Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this derive(ParquetRecordWriter) will write out all the fields, in the order in which they are defined, to a row_group.

How to Use

extern crate parquet;
#[macro_use] extern crate parquet_derive;

#[derive(ParquetRecordWriter)]
struct ACompleteRecord<'a> {
  pub a_bool: bool,
  pub a_str: &'a str,
}

RecordWriter trait

This is the new trait which parquet_derive will implement for your structs.

use super::RowGroupWriter;

pub trait RecordWriter<T> {
  fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>);
}

How does it work?

The parquet_derive crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special build.rs steps or anything like that, it's automatic by including parquet_derive in their project. The parquet_derive/src/Cargo.toml has a section saying as much:

[lib]
proc-macro = true

The rust struct tagged with #[derive(ParquetRecordWriter)] is provided to the parquet_record_writer function in parquet_derive/src/lib.rs. The syn crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a RecordWriter impl:

The fields of the struct are translated from AST to a flat FieldInfo struct. It has the bits I care about for writing a column: field_name, field_lifetime, field_type, is_option, column_writer_variant.

The code then does the equivalent of templating to build the RecordWriter implementation. The templating functionality is provided by the quote crate. At a high-level the template for RecordWriter looks like:

impl RecordWriter for $struct_name {
  fn write_row_group(..) {
    $({
      $column_writer_snippet
    })
  } 
}

this template is then added under the struct definition, ending up something like:

struct MyStruct {
}
impl RecordWriter for MyStruct {
  fn write_row_group(..) {
    {
       write_col_1();
    };
   {
       write_col_2();
   }
  }
}

and finally THIS is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their struct MyValue definition the ParquetRecordWriter will be regenerated. There's no intermediate values to version control or worry about.

Viewing the Derived Code

To see the generated code before it's compiled, one very useful bit is to install cargo expand more info on gh, then you can do:

$WORK_DIR/parquet-rs/parquet_derive_test
cargo expand --lib > ../temp.rs

then you can dump the contents:

struct DumbRecord {
    pub a_bool: bool,
    pub a2_bool: bool,
}
impl RecordWriter<DumbRecord> for &[DumbRecord] {
    fn write_to_row_group(
        &self,
        row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
    ) {
        let mut row_group_writer = row_group_writer;
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        };
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        }
    }
}

now I need to write out all the combinations of types we support and make sure it writes out data.

Procedural Macros

The parquet_derive crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.

The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, parquet_derive_test.

I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!

Potentials For Better Design

Status

I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).

I think this code is worth including in the project, with the caveat that it only generates simplistic RecordWriters. As people start to use we can add code generation for more complex, nested structs.

Closes #192

xrl commented 5 years ago

What's really unfortunate is that cargo expand detected rustfmt and reformatted the entire project. I will need to clean that up somehow. What a pain 😄 (edit: I think I mostly fixed it!)

xrl commented 5 years ago

@sadikovi have you had a chance to review this work? is this anywhere close to what you had in mind when I opened #192? 😄

sadikovi commented 5 years ago

Your approach looks certainly interesting, I did not know that it is possible to to that with macros and parsing of structs.

As I mentioned previously, my plan was reusing Row enum that we have since it maps to pretty much anything and using triplet iterators to write data. The tricky part is specifying schema, but it could always be inferred from the data if needed.

Anyway, I might start writing some code later to see the amount of work required, but would prefer not duplicating effort if possible - will be happy to help out with writes.

xrl commented 5 years ago

@sadikovi can you provide some pseudocode for how the Row enum would work?

sadikovi commented 5 years ago

Good question. It would be similar to what you have, except we would use getters to extract values and write them. Plus, certain types will allow recursive (maybe) calls to writes maps, arrays and structs. I can't show the exact code to do so, but this is how I think it could be done. Not sure if it helps.

xrl commented 5 years ago

That's very interesting. I think that technique could be folded in to this code generation. I think I'm generating too much code, code which could be solved with some From/Into/AsRef traits or a slightly different API. Handling all the recursive definitions is going to need some more sophistication I'm not quite equipped to write!

There's a lot of value to just putting a derive(...) on a structure and letting the procedural macro emit the writer code for each field. I'm very interested to hear more from you.

What is your opinion on getting this work merged in to master? What do you want to see/know?

sadikovi commented 5 years ago

Well, I need to study your code. Usage example would be nice and more documentation around the code - will make it easy for people to follow and extend. It is a great work, though it would be appreciated if you could submit the work in chunks into master, not as one large PR - easier for people to review - but not required.

Don't have to do now, I assume there is still a lot of work to do. TLDR I need to review the code again:).

sadikovi commented 5 years ago

I will have a look in the next couple of days.

xrl commented 5 years ago

A usage example is in the README.md for parquet_derive. Similarly, I would have preferred a smaller unit of work but I didn't see an alternative path. This is the whole functionality.

Would you like to set up a time for a call to walk through the code and ask questions? After you've had time to satisfy your own regular code review methods 😄

I will satisfy one more pointer style (a_property: &'a Option<_>). That won't change much of the existing code then it's useful for a lot of situations.

I don't see this code as being the end of RecordWriter generation, just a first step for shallow/unnested structs. It satisfies my production code needs. In the future it can be modified or rewritten to support arbitrarily nested structs, just needs some procedural macro-fu 😄

sunchao commented 5 years ago

Thanks for the effort @xrl ! I'll also take a look on this soon. :)

sadikovi commented 5 years ago

@xrl As I am reviewing the code can you update the PR description to mention the goal of this PR? I struggle to derive it from the code. Is it adding record writer? Thanks.

xrl commented 5 years ago

@sadikovi please post the code you are trying to derive and what errors you are seeing. Also, did you check out this branch and try running cargo test from the parquet_derive_test folder?

Edit: I added a Goals, How to Use and RecordWriter trait section to the PR.

xrl commented 5 years ago

@sadikovi I have added more explanation to the PR text above. The How does it work? section should help clarify!

xrl commented 5 years ago

I'm happy with the current status of this crate. I am using it in my own project and it generates reliable RecordWriter implementations. I think this means it's good for a thorough tire kicking @sadikovi @sunchao :)

sadikovi commented 5 years ago

I normally put some doc to any function, but it is up to you. On Sun, 9 Dec 2018 at 10:08 PM, Xavier Lange notifications@github.com wrote:

@xrl commented on this pull request.

In parquet_derive/src/lib.rs https://github.com/sunchao/parquet-rs/pull/197#discussion_r240053311:

  • "i64" => quote! { parquet::column::writer::ColumnWriter::Int64ColumnWriter },
  • "f32" => quote! { parquet::column::writer::ColumnWriter::FloatColumnWriter },
  • "f64" => quote! { parquet::column::writer::ColumnWriter::DoubleColumnWriter },
  • o => unimplemented!("don't know {} for {:#?}", o, f),
  • };
  • FieldInfo {
  • field_name: f.ident.clone().expect("must be a named field"),
  • field_lifetime,
  • field_type,
  • is_option,
  • column_writer_variant,
  • }
  • }
  • pub fn to_writer_snippet(&self) -> proc_macro2::TokenStream {

This function does not need to be public, that was my mistake. Do you still want it documented?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/pull/197#discussion_r240053311, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3szG6fk3tvjWAqqS6EsORhibvIMyks5u3XvmgaJpZM4Y9IbM .

xrl commented 5 years ago

I will be addressing this feedback soon. I'm waiting to see what happens with the switchover to the arrow project.

sunchao commented 5 years ago

Thanks @xrl . Since the merge is done, could you create a JIRA in Arrow for this feature? please also paste the description there :)

xrl commented 5 years ago

@sunchao are we doing PRs against the arrow repo? or is the work still here and it gets rebased in to arrow?

sunchao commented 5 years ago

@xrl From now on, we should do PRs against the arrow repo. This repo will be frozen. I can help review there.

xrl commented 5 years ago

Ah, I thought that would be the case! Thanks for the clarity.

xrl commented 5 years ago

My goals with nested types would be convert over to the Row enum (see: https://github.com/sunchao/parquet-rs/pull/197#issuecomment-443877186). Unfortunately I haven't used that API and looking over it I wasn't sure how it would work in in this case.

I'd like to learn more then revisit this crate to make it more general with row enum or any other techniques. Especially anything that makes determining definition levels easier, it's a tricky part of Parquet I don't have the strongest handle on.

This PR, as it is now, is useful in my own project. I am deriving a write for a field with 30+ fields and it has greatly improved my productivity. Hopefully that extends to other users and it shows a little sample of what is to come!

coveralls commented 5 years ago

Pull Request Test Coverage Report for Build 711


Totals Coverage Status
Change from base Build 690: 0.0%
Covered Lines:
Relevant Lines: 0

💛 - Coveralls
coveralls commented 5 years ago

Pull Request Test Coverage Report for Build 711


Totals Coverage Status
Change from base Build 690: 0.0%
Covered Lines:
Relevant Lines: 0

💛 - Coveralls
sunchao commented 5 years ago

Sorry @xrl I just saw your updates. Posted some replies. I do recommend to create an Arrow JIRA for this and post the PR against the repo though, since eventually this should go there as a sub-crate.

xrl commented 5 years ago

Closing in favor of the ARROW PR!