Closed xrl closed 5 years ago
What's really unfortunate is that cargo expand
detected rustfmt
and reformatted the entire project. I will need to clean that up somehow. What a pain 😄 (edit: I think I mostly fixed it!)
@sadikovi have you had a chance to review this work? is this anywhere close to what you had in mind when I opened #192? 😄
Your approach looks certainly interesting, I did not know that it is possible to to that with macros and parsing of structs.
As I mentioned previously, my plan was reusing Row enum that we have since it maps to pretty much anything and using triplet iterators to write data. The tricky part is specifying schema, but it could always be inferred from the data if needed.
Anyway, I might start writing some code later to see the amount of work required, but would prefer not duplicating effort if possible - will be happy to help out with writes.
@sadikovi can you provide some pseudocode for how the Row enum would work?
Good question. It would be similar to what you have, except we would use getters to extract values and write them. Plus, certain types will allow recursive (maybe) calls to writes maps, arrays and structs. I can't show the exact code to do so, but this is how I think it could be done. Not sure if it helps.
That's very interesting. I think that technique could be folded in to this code generation. I think I'm generating too much code, code which could be solved with some From
/Into
/AsRef
traits or a slightly different API. Handling all the recursive definitions is going to need some more sophistication I'm not quite equipped to write!
There's a lot of value to just putting a derive(...)
on a structure and letting the procedural macro emit the writer code for each field. I'm very interested to hear more from you.
What is your opinion on getting this work merged in to master? What do you want to see/know?
Well, I need to study your code. Usage example would be nice and more documentation around the code - will make it easy for people to follow and extend. It is a great work, though it would be appreciated if you could submit the work in chunks into master, not as one large PR - easier for people to review - but not required.
Don't have to do now, I assume there is still a lot of work to do. TLDR I need to review the code again:).
I will have a look in the next couple of days.
A usage example is in the README.md for parquet_derive
.
Similarly, I would have preferred a smaller unit of work but I didn't see an alternative path. This is the whole functionality.
Would you like to set up a time for a call to walk through the code and ask questions? After you've had time to satisfy your own regular code review methods 😄
I will satisfy one more pointer style (a_property: &'a Option<_>
). That won't change much of the existing code then it's useful for a lot of situations.
I don't see this code as being the end of RecordWriter
generation, just a first step for shallow/unnested structs. It satisfies my production code needs. In the future it can be modified or rewritten to support arbitrarily nested structs, just needs some procedural macro-fu 😄
Thanks for the effort @xrl ! I'll also take a look on this soon. :)
@xrl As I am reviewing the code can you update the PR description to mention the goal of this PR? I struggle to derive it from the code. Is it adding record writer? Thanks.
@sadikovi please post the code you are trying to derive and what errors you are seeing. Also, did you check out this branch and try running cargo test
from the parquet_derive_test
folder?
Edit: I added a Goals
, How to Use
and RecordWriter trait
section to the PR.
@sadikovi I have added more explanation to the PR text above. The How does it work?
section should help clarify!
I'm happy with the current status of this crate. I am using it in my own project and it generates reliable RecordWriter
implementations. I think this means it's good for a thorough tire kicking @sadikovi @sunchao :)
I normally put some doc to any function, but it is up to you. On Sun, 9 Dec 2018 at 10:08 PM, Xavier Lange notifications@github.com wrote:
@xrl commented on this pull request.
In parquet_derive/src/lib.rs https://github.com/sunchao/parquet-rs/pull/197#discussion_r240053311:
- "i64" => quote! { parquet::column::writer::ColumnWriter::Int64ColumnWriter },
- "f32" => quote! { parquet::column::writer::ColumnWriter::FloatColumnWriter },
- "f64" => quote! { parquet::column::writer::ColumnWriter::DoubleColumnWriter },
- o => unimplemented!("don't know {} for {:#?}", o, f),
- };
- FieldInfo {
- field_name: f.ident.clone().expect("must be a named field"),
- field_lifetime,
- field_type,
- is_option,
- column_writer_variant,
- }
- }
- pub fn to_writer_snippet(&self) -> proc_macro2::TokenStream {
This function does not need to be public, that was my mistake. Do you still want it documented?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/pull/197#discussion_r240053311, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3szG6fk3tvjWAqqS6EsORhibvIMyks5u3XvmgaJpZM4Y9IbM .
I will be addressing this feedback soon. I'm waiting to see what happens with the switchover to the arrow project.
Thanks @xrl . Since the merge is done, could you create a JIRA in Arrow for this feature? please also paste the description there :)
@sunchao are we doing PRs against the arrow repo? or is the work still here and it gets rebased in to arrow?
@xrl From now on, we should do PRs against the arrow repo. This repo will be frozen. I can help review there.
Ah, I thought that would be the case! Thanks for the clarity.
My goals with nested types would be convert over to the Row enum (see: https://github.com/sunchao/parquet-rs/pull/197#issuecomment-443877186). Unfortunately I haven't used that API and looking over it I wasn't sure how it would work in in this case.
I'd like to learn more then revisit this crate to make it more general with row enum or any other techniques. Especially anything that makes determining definition levels easier, it's a tricky part of Parquet I don't have the strongest handle on.
This PR, as it is now, is useful in my own project. I am deriving a write for a field with 30+ fields and it has greatly improved my productivity. Hopefully that extends to other users and it shows a little sample of what is to come!
Sorry @xrl I just saw your updates. Posted some replies. I do recommend to create an Arrow JIRA for this and post the PR against the repo though, since eventually this should go there as a sub-crate.
Closing in favor of the ARROW PR!
Goal
Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this
derive(ParquetRecordWriter)
will write out all the fields, in the order in which they are defined, to a row_group.How to Use
RecordWriter trait
This is the new trait which
parquet_derive
will implement for your structs.How does it work?
The
parquet_derive
crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any specialbuild.rs
steps or anything like that, it's automatic by includingparquet_derive
in their project. Theparquet_derive/src/Cargo.toml
has a section saying as much:The rust struct tagged with
#[derive(ParquetRecordWriter)]
is provided to theparquet_record_writer
function inparquet_derive/src/lib.rs
. Thesyn
crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating aRecordWriter
impl:The fields of the struct are translated from AST to a flat
FieldInfo
struct. It has the bits I care about for writing a column:field_name
,field_lifetime
,field_type
,is_option
,column_writer_variant
.The code then does the equivalent of templating to build the
RecordWriter
implementation. The templating functionality is provided by thequote
crate. At a high-level the template forRecordWriter
looks like:this template is then added under the struct definition, ending up something like:
and finally THIS is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their
struct MyValue
definition theParquetRecordWriter
will be regenerated. There's no intermediate values to version control or worry about.Viewing the Derived Code
To see the generated code before it's compiled, one very useful bit is to install
cargo expand
more info on gh, then you can do:then you can dump the contents:
now I need to write out all the combinations of types we support and make sure it writes out data.
Procedural Macros
The
parquet_derive
crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate,
parquet_derive_test
.I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!
Potentials For Better Design
quote!
AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.It would be nicer if I didn't have to be so picky about data going in to the(not tackling in this generation of the plugin)write_batch
function. Is it possible we could make a version of the function which acceptInto<DataType>
or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something likewrite_generic_batch(&[impl Into<DataType>])
would be neat.Another idea to improving writing columns, could we have a write function for(not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)Iterator
s? I already have aVec<DumbRecord>
, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec forwrite_batch
. Should have some significant memory advantages.It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.(moved to #203)Status
I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).
I think this code is worth including in the project, with the caveat that it only generates simplistic
RecordWriter
s. As people start to use we can add code generation for more complex, nested structs.Closes #192