Add support for reading columns as Apache Arrow arrays

andygrove commented 6 years ago

This is going to be easy. I have some ugly prototype code working already.

    let path = Path::new(&args[1]);
    let file = File::open(&path).unwrap();
    let parquet_reader = SerializedFileReader::new(file).unwrap();

    let row_group_reader = parquet_reader.get_row_group(0).unwrap();

    for i in 0..row_group_reader.num_columns() {
        match row_group_reader.get_column_reader(i) {
            Ok(ColumnReader::Int32ColumnReader(ref mut r)) => {

                let batch_size = 1024;
                let sz = mem::size_of::<i32>();
                let p = memory::allocate_aligned((batch_size * sz) as i64).unwrap();
                let ptr_i32 = unsafe { mem::transmute::<*const u8, *mut i32>(p) };
                let mut buf = unsafe {
                    slice::from_raw_parts_mut(ptr_i32, batch_size) };

//                let mut builder : Builder<i32> = Builder::with_capacity(1024);
//                let buffer = builder.finish();
                match r.read_batch(1024, None, None, &mut buf) {
                    Ok((count,_)) => {
                        let arrow_buffer = Buffer::from_raw_parts(ptr_i32, count as i32);
                        let arrow_array = Array::from(arrow_buffer);

                        match arrow_array.data() {
                            ArrayData::Int32(b) => {
                                println!("len: {}", b.len());
                                println!("data: {:?}", b.iter().collect::<Vec<i32>>());
                            },
                            _ => println!("wrong type")
                        }

                    },
                    _ => println!("error")
                }
            }
            _ => println!("column type not supported")
        }
    }

Now we need to come up with a real design and I probably need to add more helper methods to Arrow to make this easier.

andygrove commented 6 years ago

Here is a cleaned up version:

                let mut builder : Builder<i32> = Builder::with_capacity(batch_size);
                match r.read_batch(1024, None, None, builder.slice_mut(0, batch_size)) {
                    Ok((count,_)) => {
                        builder.set_len(count);
                        let arrow_array = Array::from(builder.finish());
                        match arrow_array.data() {
                            ArrayData::Int32(b) => {
                                println!("len: {}", b.len());
                                println!("data: {:?}", b.iter().collect::<Vec<i32>>());
                            },
                            _ => println!("wrong type")
                        }

andygrove commented 6 years ago

I've made good progress with integrating parquet-rs with datafusion .. I have examples like this working

    let mut ctx = ExecutionContext::local();

    let df = ctx.load_parquet("test/data/uk_cities.parquet/part-00000-bdf0c245-d300-4b28-bfdd-0f1f9cb898c4-c000.snappy.parquet").unwrap();

    ctx.register("uk_cities", df);

    // define the SQL statement
    let sql = "SELECT lat, lng FROM uk_cities";

    // create a data frame
    let df = ctx.sql(&sql).unwrap();

    df.show(10);

It only works for int32 and f32 columns though so far

sunchao commented 6 years ago

nice progress @andygrove ! really glad you can read parquet now using SQL.

sadikovi commented 6 years ago

Looks great! I am curious how Arrow handles/maps optional or repeated fields (when definition and/or repetition levels exist).

andygrove commented 6 years ago

I have merged the current Parquet support to master in DataFusion. I have added examples for both DataFrame and SQL API.

https://github.com/datafusion-rs/datafusion-rs/tree/master/examples

This is very rough code and I won't be promoting the fact that Parquet support is there until this is a little more complete and tested.

liurenjie1024 commented 5 years ago

@andygrove @sunchao Are you guys working on this? I'm working on an implementation which takes the cpp version as a reference.

sunchao commented 5 years ago

@liurenjie1024 are you working on the arrow part or the parquet-rs part? some more work in the arrow repo needs to be done so that parquet-rs can read into arrow format . I'm working (slowly) on that part.

liurenjie1024 commented 5 years ago

I'm working on the arrow part. Yes only part of cpp version can be implemented. I'll work out an early version.

Chao Sun notifications@github.com 于 2018年10月10日周三下午5:27写道：

@liurenjie1024 https://github.com/liurenjie1024 are you working on the arrow https://github.com/apache/arrow part or the parquet-rs part? some more work in the arrow repo needs to be done so that parquet-rs can read into arrow format . I'm working (slowly) on that part.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/issues/79#issuecomment-428502989, or mute the thread https://github.com/notifications/unsubscribe-auth/ACpL5SeEZrsr5j8lEbv6H_y6s3s0SG-vks5ujb2OgaJpZM4TKCyA .

nevi-me commented 5 years ago

This is unrelated, but I've seen that CSV readers are being implemented in Arrow (I think Python, C++, and there's a Go one that's an open PR).

BurntSushi's rust-csv got me wondering whether it'd be possible to implement a native CSV to arrow reader, which I think would also extend nicely to parquet.

@sunchao @andygrove , do you know if there's been discussion of something like this? Also, would some codegen be required? It's something I'd love to try contribute to in the coming weeks.

sunchao commented 5 years ago

BurntSushi's rust-csv got me wondering whether it'd be possible to implement a native CSV to arrow reader, which I think would also extend nicely to parquet.

Yes it is certainly do-able - it needs to be implemented in the arrow repo though. Some pieces may still be missing and you're welcome to work on the arrow repo :)

Also I'm not sure how this extend to parquet. Can you elaborate?

nevi-me commented 5 years ago

Yes, I've been following the Rust impl in Arrow. When I'm ready, I'll ask about it in the mailing list before opening a JIRA (didn't see one).

The extension to parquet was more in concept than anything; in the sense that if I can read a csv to arrow, I'd be able to get csv -> parquet working. I understand that csv is a "simpler" format due to its flat nature.

Also, I haven't checked to see if datafusion-rs supports csv and parquet as sinks, so one'd be able to create table my_table<parquet> as select * from input<csv>. A lot of my curiousity comes as a result of having to watch paint dry at work while I use Spark, so I've been thinking of what I'd need to be able to do to replace some of my workflow with Rust.

sunchao commented 5 years ago

Yes, I've been following the Rust impl in Arrow. When I'm ready, I'll ask about it in the mailing list before opening a JIRA (didn't see one).

Yes feel free to create a JIRA :)

The extension to parquet was more in concept than anything; in the sense that if I can read a csv to arrow, I'd be able to get csv -> parquet working. I understand that csv is a "simpler" format due to its flat nature.

I see. Yes I think we discussed (cc @sadikovi ) multiple times about creating a CSV -> parquet converter. This will be convenient. We should certainly do it.

liurenjie1024 commented 5 years ago

I'm working on an arrow reader implementation and have finished the first step, converting parquet schema to arrow schema in this PR, please help to review this.

sunchao commented 5 years ago

The umbrella task is #186. Please watch that issue for progress on this matter.

sunchao / parquet-rs

Add support for reading columns as Apache Arrow arrays #79