jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.06k stars 222 forks source link

Add support to read from Apache ORC #759

Closed jorgecarleitao closed 2 years ago

jorgecarleitao commented 2 years ago

The core development for this is being carried out here: https://github.com/jorgecarleitao/orc-rs. The hope is that once we can read stripes there, we can plug that here and deserialize to arrow, just like we do for parquet.

PRs over there are of course very welcome :bow:

Igosuki commented 2 years ago

AFAIK Orc is a better parquet made out of frustration with using parquet on object storage. ORC has indexing, which makes it a lot easier to distribute chunks to partitions for distributed computing, for instance. It would be interesting to have ORC's advantageous features appear in the higher level API as something that dependent libraries (i.e. datafusion) could then use ? Let me know what you think !

iajoiner commented 2 years ago

@jorgecarleitao Actually Iā€™m also working on an ORC reader in Rust. I plan to add the writer as well.

See https://issues.apache.org/jira/projects/ORC/issues/ORC-1180 https://issues.apache.org/jira/projects/ORC/issues/ORC-1181

jorgecarleitao commented 2 years ago

Hey @iajoiner , that is awesome to know! So, I have been working on this in the past and I finally got the time and mind (vacations!) space publish it

The implementation is available at https://github.com/DataEngineeringLabs/orc-format (https://crates.io/crates/orc-format) and contains the bare-bones to read ORC - I added integration tests against pyorc (the official implementation) of the things that work.

I wrote it as performant as I could, the only sub-performant piece is "bitunpacking", that afaik there is no performant implementation in Rust for u64 (for u32 there is bitpacking); I just implemented a (non-performant) that passes tests.

There is of course a lot of things missing from the spec. If you want we can pair up and work on it. I think that the main difference is that it is not using build.rs. The reason is that I really like to have the generated code easily available via IDE "click on struct/function", something that the build.rs takes away (since it is embedded via an include clause).

Note that, as I am doing with parquet2 and avro-schema, I do not declare an in-memory format in the crate and instead provide a toolkit (e.g. iterators, generics) to decompress,decode and deserialize from ORC (and use them in integration tests, where I use an in-memory format for testing purposes).

I am planning to start integrating that dependency in this project so that we can read into Arrow. This will offer important input about the API, whether we need bridge structs from proto, to help users, and further testing.

Let me know your thoughts (here or preferably on https://github.com/DataEngineeringLabs/orc-format).

jorgecarleitao commented 2 years ago

This has been closed by #1189 šŸŽ‰šŸŽ‰šŸŽ‰