jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.06k stars 221 forks source link

Make Parquet read sync and async apis consistent #669

Open mdrach opened 2 years ago

mdrach commented 2 years ago

In v0.7.0 I could stream in pages of a Parquet column chunk in an async context, then move the data into a dedicated thread pool to perform the CPU-intensive work.

        let mut reader = RangedHttpStreamer::new(http_client, url, shard_size);
        let stream = get_page_stream(&column_chunk_metadata, &mut reader, None, vec![])
            .await
            .map_err(Error::internal)?;
        let pages = stream.collect::<Vec<_>>().await;

        let array: Result<Box<dyn arrow2::array::Array>> = spawn_blocking(move || {
            let mut basic_decompressor = BasicDecompressor::new(pages.into_iter(), vec![]);
            page_iter_to_array(
                &mut basic_decompressor,
                &column_chunk_metadata,
                field.data_type.clone(),
            )
            .map_err(Error::internal)
        })
        .await

However, as of v0.8.0 page_iter_to_array has been replaced by column_iter_to_array while the async api does not expose a corresponding get_column_stream (only get_page_stream). Is there a better way to load and parse a parquet file from S3? Or, are APIs just out of sync?

jorgecarleitao commented 2 years ago

The APIs are out of sync.

Note that the reason for the column_iter is that it allows for nested parquet types. An alternative is to offer a page stream per parquet column and have the users assemble the columns themselves into the corresponding Arrow type, but I think that that requires us to expose a larger (currently private) API and more documentation.

jorgecarleitao commented 2 years ago

Would you like to tackle this one, or, do you think I should prioritize it?

mdrach commented 2 years ago

If you could prioritize that would be great. I may be able to get to this, but likely not in the short term.

jorgecarleitao commented 2 years ago

I have started working on this. The first change is on parquet2, since there is where we declare these APIs.