sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Implement Drop for RowGroupWriter, ColumnWriter, and friends #173

Open xrl opened 5 years ago

xrl commented 5 years ago

I'm learning parquet with this library and it there's a lot of room for messing up when it comes to closing various resources. It would be more rust-y (idiomatic) if the library provided Drop implementations for these types.

Take this code for example:

{
    let mut pfile = create_addresses_parquet_file().unwrap();

    let mut row_group = pfile.next_row_group().unwrap();

    for p in 0..page_count {
        let res = addresses.order(ad_dsl::id.desc()).limit(page_size).offset(p * page_size).load::<Address>(&conn).unwrap();
        println!("got a page {}", p);

        let mut column_writer = row_group.next_column().unwrap().unwrap();
        let addr_ids = res.iter().map(|x| x.id).collect::<Vec<i32>>();
        if let ColumnWriter::Int32ColumnWriter(ref mut typed) = column_writer {
            typed.write_batch(&addr_ids[..], None, None).unwrap();
        }
        row_group.close_column(column_writer).unwrap();
    }

    pfile.close_row_group(row_group).unwrap();
}

there are two ways I can mess up (and indeed did), forgetting to call close_column or forgetting to call close_row_group.

The std::ops::Drop trait could be rigged up to call those close functions, if we can keep track of the parent object somehow. The trait definition:

pub trait Drop {
    fn drop(&mut self);
}

and the compiler will make sure this destructor is called in reverse order of value creation, so we can be sure that a column writer is closed before a row group, etc.

Hopefully the code would someday turn out simpler:

{
    let mut pfile = create_addresses_parquet_file().unwrap();

    let mut row_group = pfile.next_row_group().unwrap();

    for p in 0..page_count {
        let res = addresses.order(ad_dsl::id.desc()).limit(page_size).offset(p * page_size).load::<Address>(&conn).unwrap();
        println!("got a page {}", p);

        let mut column_writer = row_group.next_column().unwrap().unwrap();
        let addr_ids = res.iter().map(|x| x.id).collect::<Vec<i32>>();
        if let ColumnWriter::Int32ColumnWriter(ref mut typed) = column_writer {
            typed.write_batch(&addr_ids[..], None, None).unwrap();
        }
        // std::ops::Drop::drop(column_writer) called by compiler
    }

    // std::ops::Drop::drop(row_group) called by compiler
}
sadikovi commented 5 years ago

Have you read the documentation? Was it clear how to use column writers and row group writers?

From definition: Because of this recursive dropping, you do not need to implement this trait unless your type needs its own destructor logic.. Even having implemented this trait, you can still close out of order - so it does not necessarily solve the problem. The major issue is keeping the track of a row group writer or a file writer, but we were trying to avoid that by design. And, if I remember correctly, we had bunch of checks to prevent out of order writes.

IMHO, the better approach would be our first design, when we would return a mutable reference of column writer (similar for row group writer), this allowed us to keep track of what is being written and closing row groups and column writers automatically and enforce the order - but we decided to go with the current implementation, you can find more details on the PR and issue.

Column reader and column writer APIs are considered low-level, e.g. does not handle def/rep levels for you, and it is up to you to follow the doc. You can compare that to our record reader API, which is idiomatic Rust.

I think we just need to make it clear in our documentation.

xrl commented 5 years ago

I am very new to parquet and I'm still working through the core concepts of row group writers (columns I understand, but I guess column aren't necessarily contiguous? they go through pages/chunks?). I was happy to just write a file with INT32 IDs in it!

Now I am moving on to more complex values and it does seem very error prone. I understand it's a low level writer and there's room for operator error.

I did see there was a record reader API, which looks very friendly to use, but I don't see any similar API for a record writer. I think you are right, a mutable reference where you have to request a new row group writer/column writer would be more idiomatic rust, the compiler would enforce there's only ever one mutable borrow of the file.

The documentation didn't have many annotated examples of writing data, I've been peeking through test cases to figure out more. As I learn more I hope to contribute those back!

xrl commented 5 years ago

I think this is the relevant issue/pr pair:

https://github.com/sunchao/parquet-rs/issues/116 https://github.com/sunchao/parquet-rs/pull/149

sadikovi commented 5 years ago

Yes, you are right - there is a room for improvement in parquet writer. I have been snowed under recently, will try getting back and improving things in parquet, in particular working on the tools to convert csv/json tables into parquet (no guarantees though!).

Parquet page is the smallest self contained unit of data that you can read, it has all of the encoding information, compression information, byte data, definition and repetition levels. There are different types of pages: data, dictionary, index, etc.

Each column chunk is the set of pages that has to be read contiguously and represents a leaf column in parquet schema. All of the nesting and nulls are encoded with repetition and definition levels respectively.

Normally row group will contain the number of column chunks to match parquet schema leaf nodes. It is a requirement that all column chunks are written sequentially, one column after another. Meanwhile, metadata is recorded and stored as part of column chunk and row group. This allows column pruning and statistics based filtering.

Each file can contain some number of row groups, or can contain 0 row groups (empty file). Normally query engines parallelise files on row groups, with the smallest unit of reading data being column chunk, since they create an iterator for a column.

Users are supposed to handle complex values like lists, maps and structs themselves using current writer API. Writer API has been added recently and simply has not had much of API usability testing, and I wrote documentation as much as I could at the time.

It would be great if you could have a look and see if there are any problems/missing docs in the code and help fixing them!

sadikovi commented 5 years ago

Actually, the more I think about this, the more I tend to agree that your approach is the best and removes the necessity to close writers. We should implement Drop for all of those types, because it is quite annoying to call close every time. I will try it out and open a PR for it.

xrl commented 5 years ago

It'll be interesting to see what you come up with. I'm concerned with your initial, and correct, observation: you can still close out of order. Perhaps closing the individual column writers may be left up to the RowGroup?

{
  row_group.next_column() // pushes the column on to a stack of "to be closed" references
  // do stuff with column
} // row_group is dropped, closes referenced columns in reverse order of allocation and panics(?) if all columns where not handled?
  // or perhaps it just lets users write corrupt files?
sadikovi commented 5 years ago

I think we will just throw an error when this happens. I will try and see what I can do. Thanks!

On Sat, 3 Nov 2018 at 18:13 Xavier Lange notifications@github.com wrote:

It'll be interesting to see what you come up with. I'm concerned with your initial, and correct, observation: you can still close out of order. Perhaps closing the individual column writers may be left up to the RowGroup?

{ row_group.next_column() // pushes the column on to a stack of "to be closed" references // do stuff with column } // row_group is dropped, closes referenced columns one a a time and panics(?) if all columns where not handled? or perhaps it just lets users write corrupt files?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/issues/173#issuecomment-435604811, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3oaIGlQKS0coIUGbxYgARzut8HnBks5urc7HgaJpZM4XYwGH .

xrl commented 5 years ago

I don’t think Drop can return an error. I think the return type is ().

On Nov 3, 2018, at 14:14, Ivan notifications@github.com wrote:

I think we will just throw an error when this happens. I will try and see what I can do. Thanks!

On Sat, 3 Nov 2018 at 18:13 Xavier Lange notifications@github.com wrote:

It'll be interesting to see what you come up with. I'm concerned with your initial, and correct, observation: you can still close out of order. Perhaps closing the individual column writers may be left up to the RowGroup?

{ row_group.next_column() // pushes the column on to a stack of "to be closed" references // do stuff with column } // row_group is dropped, closes referenced columns one a a time and panics(?) if all columns where not handled? or perhaps it just lets users write corrupt files?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/issues/173#issuecomment-435604811, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3oaIGlQKS0coIUGbxYgARzut8HnBks5urc7HgaJpZM4XYwGH .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

sadikovi commented 5 years ago

It is not what I meant, I know it returns Unit. I was thinking about simply panicking when when one of the operations is not performed correctly, in order to at least indicate an error somehow. But that results in some weird behaviour, and expected error would always result in panic.

It does look like moving to &mut might be the best solution yet. Closing out of order can be relatively easily solved by (global) id counter, so you would always know what column writer/ row group writer should be closed next.

Let's keep it open, until someone comes up with an idea or this will simply be not a problem when we build high level API.

sunchao commented 5 years ago

Thanks for the discussion. I think my original concern is that, if we return &mut, the returned row group writer can only be used within the same call stack as the file writer, which is a limitation. I agree that changing it to &mut can simplify things though. We can explore this idea.

With &mut, instead of having a drop, can we just close the previous writer when calling next_row_group? I believe this can solve most of the issues @xrl encountered. We can still provide a close method for file/row group/column etc, if people need to explicitly close them.

sadikovi commented 5 years ago

We already have that, in a way. I suggest we keep it open, I will keep thinking how to solve it. On Wed, 7 Nov 2018 at 7:48 PM, Chao Sun notifications@github.com wrote:

Thanks for the discussion. I think my original concern is that, if we return &mut, the returned row group writer can only be used within the same call stack as the file writer, which is a limitation. I agree that changing it to &mut can simplify things though. We can explore this idea.

With &mut, instead of having a drop, can we just close the previous writer when calling next_row_group? I believe this can solve most of the issues @xrl https://github.com/xrl encountered. We can still provide a close method for file/row group/column etc, if people need to explicitly close them.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/issues/173#issuecomment-436734988, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3pqTlGlrvq-vC-_A4AmWfze2i1uvks5usysIgaJpZM4XYwGH .

xrl commented 5 years ago

@sunchao can you further explain why it's useful for row group writer and file writer to be used in different call stacks? Does that mean you want to send row group writers to different threads?

sunchao commented 5 years ago

@xrl I don't have a concrete use case for now - just think people may want the flexibility to store the writers in heap. I'm pretty OK to change to &mut as long as it makes things easy. I haven't thought about the scenario of concurrent row group/page writers, though I think either approach can support that use case.

xrl commented 5 years ago

I think it will look more "rust-y" to have the &mut and there will be less boilerplate.

If users did want to put some part of the writer in the heap I think it'd be best to just put the whole file handle in there. Otherwise the parquet library would need Rc/Arc in order to track when those writers are done. But I don't understand why a user would want just one part of the writer in the heap and not just the whole file.

sunchao commented 5 years ago

Yes it looks more rusty that way and potentially more performant, though I don't understand the second paragraph you said though. Can you elaborate that?

xrl commented 5 years ago

My comments about Rc/Arc are probably non-sense :)

I think we can support people putting their values in Box.