Open dbr opened 2 years ago
Nop, we do not have such a methods. That is because doing so requires arrow depending on arrow2 or vice-versa.
The arrow format was designed exactly to support this case, though, and the FFI is really the conversion. What happens is that a RecordBatch
is not part of arrow in-memory specification, it is just a struct that exists in some implementations.
There could be a crate one level higher that implements the conversion.
arrow-conv
|
/\
/ \
arrow arrow2
It could maybe support conversion of the core types, ArrayRef
and RecordBatch
.
off topic, but in case you missed it, there is also a fairly uptodate arrow2 branch for datafusion that's being worked on.
Hi @houqp , may I ask what branch is that? and is that going to be under apache or also as part of a separate repo?
I think it means https://github.com/apache/arrow-datafusion/pull/68
doing so requires arrow depending on arrow2 or vice-versa.
This could easily be feature-gated though, so there is no hard dependency. The FFI works but is very awkward to do manually since it involves working with raw pointers, versus the lovely abstraction of just calling .into()
.
This could easily be feature-gated though, so there is no hard dependency
True, although I think there is some benefits to having it be a separate crate, mainly the releases wouldn't be tied to any either project's release cycle (i.e a new release of the conv crate could be made whenever arrow or arrow2 is released)
It also could allow the crate to work on multiple versions of either version via feature flags (similar to how multiple winit versions are handled here)
The crate could also support the pyarrow interop via PyO3
, which would have the same benefits (e.g currently arrow-rs 9 only supports pyo3 0.15, and we are blocked on updating pyo3 until next release of arrow
)
Biggest drawback of it being a separate crate would be you can't as neatly implement some conversion traits (since they would be forgien types)
I think a conversion crate would be valuable indeed, though I don't have time to work on such a thing now.
We could potentially host it in the https://github.com/datafusion-contrib organization and there might be others in the community who are interested in helping -- see the arrow2
milestone https://github.com/apache/arrow-datafusion/milestone/3
designing a bit, I think that we would need 6 functions:
pub fn arrow_to_arrow2_error(error: arrow::error::ArrowError) -> arrow2::error::ArrowError;
pub fn arrow_to_arrow2_field(field: arrow::datatypes::Field) -> Result<arrow2::datatypes::Field, arrow2::error::ArrowError>;
pub fn arrow_to_arrow2_array(array: Arc<dyn arrow::array::Array>) -> Result<Arc<dyn arrow2::array::Array>, arrow2::error::ArrowError>;
pub fn arrow2_to_arrow_error(error: arrow2::error::ArrowError) -> arrow::error::ArrowError;
pub fn arrow2_to_arrow_field(field: arrow2::datatypes::Field) -> Result<arrow::datatypes::Field, arrow::error::ArrowError>;
pub fn arrow2_to_arrow_array(array: Arc<dyn arrow2::array::Array>) -> Result<Arc<dyn arrow::array::Array>, arrow::error::ArrowError>;
You miss out on a valid naming option: arrow2arrow
:stuck_out_tongue_winking_eye:
You miss out on a valid naming option: arrow2arrow 😜
🤯 lol
I'm using
arrow2
in a project along with polars. I now also want to send the same data to datafusion, which usesarrow
(ideally without having to send the data through via the IPC serialization format)Before I fumble around with the FFI API, I figured I would check first: is there a method somewhere which handles the conversion of a
RecordBatch
from arrow to arrow2 and vice versa? Seems like something that might exist already in some kind of integration test or similar, but "arrow<->arrow2" is quite a tricky thing to search for