jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.06k stars 221 forks source link

arrow<->arrow2 interopability conversion method? #629

Open dbr opened 2 years ago

dbr commented 2 years ago

I'm using arrow2 in a project along with polars. I now also want to send the same data to datafusion, which uses arrow (ideally without having to send the data through via the IPC serialization format)

Before I fumble around with the FFI API, I figured I would check first: is there a method somewhere which handles the conversion of a RecordBatch from arrow to arrow2 and vice versa? Seems like something that might exist already in some kind of integration test or similar, but "arrow<->arrow2" is quite a tricky thing to search for

jorgecarleitao commented 2 years ago

Nop, we do not have such a methods. That is because doing so requires arrow depending on arrow2 or vice-versa.

The arrow format was designed exactly to support this case, though, and the FFI is really the conversion. What happens is that a RecordBatch is not part of arrow in-memory specification, it is just a struct that exists in some implementations.

ritchie46 commented 2 years ago

There could be a crate one level higher that implements the conversion.

    arrow-conv
         |
        /\
      /    \
arrow      arrow2 

It could maybe support conversion of the core types, ArrayRef and RecordBatch.

houqp commented 2 years ago

off topic, but in case you missed it, there is also a fairly uptodate arrow2 branch for datafusion that's being worked on.

renato2099 commented 2 years ago

Hi @houqp , may I ask what branch is that? and is that going to be under apache or also as part of a separate repo?

alamb commented 2 years ago

I think it means https://github.com/apache/arrow-datafusion/pull/68

multimeric commented 2 years ago

doing so requires arrow depending on arrow2 or vice-versa.

This could easily be feature-gated though, so there is no hard dependency. The FFI works but is very awkward to do manually since it involves working with raw pointers, versus the lovely abstraction of just calling .into().

dbr commented 2 years ago

This could easily be feature-gated though, so there is no hard dependency

True, although I think there is some benefits to having it be a separate crate, mainly the releases wouldn't be tied to any either project's release cycle (i.e a new release of the conv crate could be made whenever arrow or arrow2 is released)

It also could allow the crate to work on multiple versions of either version via feature flags (similar to how multiple winit versions are handled here)

The crate could also support the pyarrow interop via PyO3, which would have the same benefits (e.g currently arrow-rs 9 only supports pyo3 0.15, and we are blocked on updating pyo3 until next release of arrow)

Biggest drawback of it being a separate crate would be you can't as neatly implement some conversion traits (since they would be forgien types)

alamb commented 2 years ago

I think a conversion crate would be valuable indeed, though I don't have time to work on such a thing now.

We could potentially host it in the https://github.com/datafusion-contrib organization and there might be others in the community who are interested in helping -- see the arrow2 milestone https://github.com/apache/arrow-datafusion/milestone/3

jorgecarleitao commented 2 years ago

designing a bit, I think that we would need 6 functions:

pub fn arrow_to_arrow2_error(error: arrow::error::ArrowError) -> arrow2::error::ArrowError;
pub fn arrow_to_arrow2_field(field: arrow::datatypes::Field) -> Result<arrow2::datatypes::Field, arrow2::error::ArrowError>;
pub fn arrow_to_arrow2_array(array: Arc<dyn arrow::array::Array>) -> Result<Arc<dyn arrow2::array::Array>, arrow2::error::ArrowError>;

pub fn arrow2_to_arrow_error(error: arrow2::error::ArrowError) -> arrow::error::ArrowError;
pub fn arrow2_to_arrow_field(field: arrow2::datatypes::Field) -> Result<arrow::datatypes::Field, arrow::error::ArrowError>;
pub fn arrow2_to_arrow_array(array: Arc<dyn arrow2::array::Array>) -> Result<Arc<dyn arrow::array::Array>, arrow::error::ArrowError>;
ritchie46 commented 2 years ago

You miss out on a valid naming option: arrow2arrow :stuck_out_tongue_winking_eye:

alamb commented 2 years ago

You miss out on a valid naming option: arrow2arrow 😜

🤯 lol