rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
7.99k stars 866 forks source link

[FEA] Introduce a new owning type for Arrow interop data #16104

Open vyasr opened 4 days ago

vyasr commented 4 days ago

Is your feature request related to a problem? Please describe. Currently libcudf's primary data type is the column, which is a wrapper around a set of rmm::device_buffer objects that own its memory. The column has sole ownership of the data via the underlying unique_ptr semantics. From an algorithmic perspective this is fine because libcudf operates entirely on views of data. While column provides convenient conversions to column_view, users of libcudf could just as easily create the views from any other data source because column_view objects may be constructed from an arbitrary set of data pointers (this is leveraged by cuDF Python, for instance, which handles ownership differently from libcudf and therefore extracts data from column objects as soon as any libcudf algorithm returns one).

However, the ownership semantics of column are not flexible enough to accommodate the ingestion of Arrow device data via the C Data interface. Historically, libcudf has only supported consuming host arrow data, which intrinsically requires a copy and therefore the creation of a column is fine. With the work done on https://github.com/rapidsai/cudf/issues/14926, though, libcudf now supports conversion from device arrow data. Since the purpose of the Arrow interface is to support zero-copy sharing of data across library boundaries, it supports the hand-off of data that we may need to keep alive for an indeterminate amount of time but that we should eventually release. Critically, in this scenario "releasing" does not mean freeing the data. Rather releasing is a process defined by the producer of the data that may free it, or may simply perform some other bookkeeping to allow it to be freed later (e.g. if there are shared memory semantics involved). There is no good way to represent these semantics with column right now. resulting in the more complex considerations laid out in this comment.

Describe the solution you'd like We should define a new type cudf::arrow_column that can faithfully represent the Arrow interface's memory semantics. This type could be used for both host and device arrow data (i.e. by both from_arrow_host and from_arrow_device) and would be responsible for storing an Arrow[Device]Array and then calling its release pointer upon destruction. If necessary, this type could itself expose a mechanism by which its ArrowArray could be exported, i.e. it could become a producer for the C Data interface. To do so, it would need to wrap its own ArrowArray in an internal object with something like reference counting semantics so that the original producer's release callback would not be invoked until all re-exported arrays were also destroyed. This approach would allow us to unify all of the existing APIs into a simpler set of function overloads with well-defined memory semantics that the caller no longer has to be aware of.

Describe alternatives you've considered We could alternatively try to find a way to support shared ownership at a deeper level by making it possible to construct a shared version of an rmm::device_buffer. This would require substantially more work, though, and might still require refactoring of cudf internals to use such an object. Simply making it possible to construct an rmm::device_buffer from a preexisting pointer possible in such a way that the buffer assumes ownership (analogous to the std::unique_pointer(pointer p) overload) would not be sufficient since what we need is a way for the buffer to not free the memory on deletion but to instead call the release callback. This seems out of scope for rmm and more like a cudf feature since it's specifically for arrow interop.

Additional context The extended discussion leading to this issue may be found here.

vyasr commented 4 days ago

CC @kkraus14 @jorisvandenbossche @paleolimbot @jrhemstad @davidwendt

jrhemstad commented 4 days ago

A new type is fine, but I'm not clear on what you're envisioning for where/how this cudf::arrow_column would be used.

vyasr commented 4 days ago

The API I'm envisioning would be

std::unique_ptr<cudf::arrow_column> from_arrow(ArrowSchema const* schema, ArrowDeviceArray *input, rmm::cuda_stream_view stream, rmm::mr::device_memory_resource mr);

The object would look roughly like this:


class arrow_column {
  arrow_column(ArrowDeviceArray *input) {
    ArrowArrayMove(input, arr.get());
  }
  ~arrow_column() {
    arr.release();
  }
  column_view view();
  mutable_column_view mutable_view();
private:
  // Using a shared_ptr to potentially allow re-export in the future, but that would require extra machinery to get right.
  std::shared_ptr<ArrowDeviceArray> arr;
}
jrhemstad commented 4 days ago

Got it, so the idea would be to just introduce this type as a new return type for from_arrow that preserves the shared ownership semantics. Other cudf APIs would be unaffected and we wouldn't expect to update other APIs to also try and preserve shared ownership semantics.

Makes sense to me.

davidwendt commented 4 days ago

This reminds me a bit of contiguous-split where a struct is returned that contains device memory along with a view into that data. Though the details do not quite match here technically. I'd like to think of this in terms of that perhaps.

vyasr commented 4 days ago

Whoops also obviously CC @zeroshade

vyasr commented 4 days ago

Here is a more complete sketch of what I'm imagining. I haven't thought all the way through how the to_arrow side of things should look, but here's one proposal:

// Class to manage lifetime semantics and allow re-export.
struct arrow_array_container {
  ArrowDeviceArray* arr;
  // Question: When the input data was host data, we could presumably release
  // immediately. Do we care? If so, how should we implement that?
  ~arrow_array_container() {
    arr->array.release(&arr->array);
  }
};

class arrow_column {
public:
  arrow_column(ArrowDeviceArray *input) {
    ArrowArrayMove(input, container->arr);
  }
  cudf::column_view view();
  cudf::mutable_column_view mutable_view();

  // Create Array whose private_data contains a shared_ptr to this->container
  // The output should be consumer-allocated, see
  // https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
  // Note: May need stream/mr depending on where we put what logic.
  void to_arrow(ArrowDeviceArray *output);
private:
  // Using a shared_ptr allows re-export via to_arrow
  std::shared_ptr<arrow_array_container> container;
};

class arrow_table {
public:
  arrow_table(std::vector<std::shared_ptr<arrow_column> columns) : columns{columns} {}
  cudf::table_view view();
  cudf::mutable_table_view mutable_view();
  // Create Array whose private_data contains shared_ptrs to all the underlying arrow_array_containers
  void to_arrow(ArrowDeviceArray *output);
private:
  // Would allow arrow_columns being in multiple arrow_tables
  std::vector<std::shared_ptr<arrow_column> columns;
};

// ArrowArrayStream and ArrowArray overloads (they can be overloads now instead
// of separate functions) are trivial wrappers around this function. Also need versions
// of all three that return an arrow_column instead of an arrow_table.
std::unique_ptr<arrow_table> from_arrow(
  ArrowSchema const* schema,
  ArrowDeviceArray *input,
  rmm::cuda_stream_view stream,
  rmm::mr::device_memory_resource mr);

// Produce an ArrowDeviceArray and then create an arrow_column around it.
std::unique_ptr<arrow_table> to_arrow(
  // Question: Do we really need a column_view overload? If we're going this
  // route, I think it's OK to always require a transfer of ownership to the
  // arrow_table, but there is potentially some small overhead there.
  std::unique_ptr<cudf::table> input,
  rmm::cuda_stream_view stream      = cudf::get_default_stream(),
  rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource());