jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.07k stars 223 forks source link

[question] ownership and memory when using ffi #1455

Closed SimonSchneider closed 1 year ago

SimonSchneider commented 1 year ago

I have some potentially naive questions, if this is the wrong place feel free to close this and time permitting point me at a better place to ask.

I'm using FFI to move data from arrow2 to arrow-rs. It all works but I'm confused about how the memory and ownership works. I'm using arrow2 through polars and I've been able to get the below methods to do what I want, which is to give me arrow-rs ArrayRefs which I can make into record batches and continue from there.

use polars::export::arrow::array::Array as Array2;
// more use (any type suffixed by 2 is from arrow2 and any that is not is arrow-rs

fn arrow2_to_arrow_array(data: Box<dyn Array2>, data_type: &DataType) -> Result<ArrayRef> {
    let schema = FFI_ArrowSchema::try_from(data_type)?;
    let data_ffi = ffi2::export_array_to_c(data);
    let data: FFI_ArrowArray = unsafe { transmute(data_ffi) };
    Ok(make_array(ArrowArray::new(data, schema).try_into()?))
}

fn series_to_arrow(series: &Series) -> Result<(Field, Vec<ArrayRef>)> {
    let field = arrow2_to_arrow_field(&series.field().to_arrow())?;
    let array_data = series
        .chunks()
        .iter()
        .map(|s| arrow2_to_arrow_array(s.to_boxed(), field.data_type()))
        .collect::<Result<Vec<_>>>()?;
    Ok((field, array_data))
}

What I can't understand is how this works memory wise. In the series_to_arrow I provide a &Series which to me indicates I'm not taking ownership of the underlying data. But then in arrow2_to_arrow_array I'm able to use ffi to transmute and "give" the data to arrow-rs.

My intuition has a couple of different scenarios:

  1. somewhere the memory is copied and I have 2 copies of the data
  2. the memory is moved to arrow-rs, but that seems impossible because I only provide a reference to the Series, so I would not expect this to be able to take ownership of the series data.
  3. the data is somehow shared between arrow2 and arrow-rs, but in that case who is the owner. There's no Arc or anything from what I could find, so who cleans it up?

I guess it shows but I'm quite confused here and I'm unsure if this is even safe to do or if I'm wielding a footgun. However using the arrow memory layout is supposed to enable zero copy sharing so I'm guessing there has to be some way of achieving this.

Perhaps you could shed some light on what is actually happening?

SimonSchneider commented 1 year ago

I've done some testing now allocating a huge polars Dataframe and moving it to arrow-rs and looking at the memory:

let df = create_huge_dataframe();

sleep_and_check_memory("created df") //<---- memory at about 750Mb

let arrow_rs_data = df_to_record_batches(&df);

sleep_and_check_memory("created arrow_rs"); //<---- memory at about 750Mb

run_some_query_on_df_and_drop_it(df);

sleep_and_check_memory("ran some query on the DF and dropped it"); //<---- memory at about 750Mb

run_some_query_on_arrow_rs_data_and_drop_it(arrow_rs_data); 

sleep_and_check_memory("ran some query on the arrow_rs_data and dropped it"); //<---- memory at about ~20Mb

this was somewhat enlightening but I don't think I've grokked it fully yet, it looks like arrow2 and arrow-rs are using the same memory (indicated by the fact that memory does not double when creating the arrow-rs data. However, they are somehow fine sharing the memory and cleaning it up after none of them are referencing it. How that works I'm not sure 🤷

ritchie46 commented 1 year ago

This is a key feature of arrow! Any arrow implementation within same process can use the arrow C interface to share pointers within the same process: https://arrow.apache.org/docs/format/CDataInterface.html

This allows us to share memory of any implementations, arrow2, arrow-rs, pyarrow, arrow C++ etc.

The implementation remains responsible for destruction of its own memory. It sends a function over the FFI and the other implementation will call that function if the memory needs to be deleted.

SimonSchneider commented 1 year ago

Thank you ritchie46. That clears it up I think, I was aware that the memory looking a certain way was important for sharing but had never really thought about or understood how freeing is handled. To reiterate then, in my case then the arrow2 implementation actually keeps track of the fact that the memory is also referenced by the arrow-rs, probably via some Arc. When arrow-rs is done it'll call the "release callback" to say it is done with it and arrow2 can clean up the memory when everyone is done.