rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.07k stars 873 forks source link

[BUG] Segfault in pylibcudf to_arrow interop when passing nested list and metadata #16069

Closed lithomas1 closed 1 week ago

lithomas1 commented 3 weeks ago

Describe the bug A clear and concise description of what the bug is.

plc.interop.to_arrow is segfaulting when passed (possibly invalid) metadata.

Steps/Code to reproduce bug Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

import cudf._lib.pylibcudf as plc
import pyarrow as pa

def metadata_from_arrow_type(
    pa_type: pa.Array,
    name: str = "",
) -> plc.interop.ColumnMetadata | None:
    metadata = plc.interop.ColumnMetadata(name) #None
    if pa.types.is_list(pa_type) or pa.types.is_struct(pa_type):
        child_meta = []
        for i in range(pa_type.num_fields):
            field_meta = metadata_from_arrow_type(pa_type.field(i).type, pa_type.field(i).name)
            child_meta.append(field_meta)
        metadata = plc.interop.ColumnMetadata(
            name,
            # libcudf does not store field names, so just match pyarrow's.
            child_meta
        )
    return metadata

pa_array = pa.array(
    [[[4]], [[5, 6]], [[7]], [[8]], [[9]], [[10]]], type=pa.list_(pa.list_(pa.int64()))
)

plc_array = plc.interop.from_arrow(pa_array)

print(metadata_from_arrow_type(pa_array.type))
new_pa_array = plc.interop.to_arrow(plc_array, metadata=metadata_from_arrow_type(pa_array.type))

This code will segfault or raise memoryerror.

Expected behavior A clear and concise description of what you expected to happen.

No segfault. If the metadata being passed is wrong, an error should be raised (like in the struct case).

Environment overview (please complete the following information)

Environment details Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context

The error seems to occur in the C++ arrow conversion code (but still could be a bug in either pylibcudf or libcudf), so I'm labelling as both a libcudf and pylibcudf issue.

wence- commented 3 weeks ago

This is because the to_arrow interop expects, if the column_metadata for the children of a ListColumn is not empty, that you provide metadata for both the offsets column and the "values" column.

If you use the following:

def metadata_from_arrow_type(
    pa_type: pa.Array,
    name: str = "",
) -> plc.interop.ColumnMetadata | None:
    metadata = plc.interop.ColumnMetadata(name) #None
    if pa.types.is_list(pa_type):
        child_meta = [plc.interop.ColumnMetadata("offsets")]
        for i in range(pa_type.num_fields):
            field_meta = metadata_from_arrow_type(pa_type.field(i).type, pa_type.field(i).name)
            child_meta.append(field_meta)
        metadata = plc.interop.ColumnMetadata(name, child_meta)
    elif pa.types.is_struct(pa_type):
        child_meta = []
        for i in range(pa_type.num_fields):
            field_meta = metadata_from_arrow_type(pa_type.field(i).type, pa_type.field(i).name)
            child_meta.append(field_meta)
        metadata = plc.interop.ColumnMetadata(
            name,
            # libcudf does not store field names, so just match pyarrow's.
            child_meta
        )
    return metadata

Then there are no errors.

I debugged this by:

lithomas1 commented 3 weeks ago

Thanks a bunch for the debugging!

This is because the to_arrow interop expects, if the column_metadata for the children of a ListColumn is not empty, that you provide metadata for both the offsets column and the "values" column.

This makes sense, I was thinking in terms of pyarrow ListType which doesn't have this. (This could be better documented, though, probably on the interop page - which I now realize isn't getting built in the docs.)

I'll leave this issue open, though, since I don't think the code should segfault. (For structs, a nice runtime error is raised).

wence- commented 3 weeks ago

Yeah, we can do something like this:

diff --git a/cpp/src/interop/to_arrow.cu b/cpp/src/interop/to_arrow.cu
index 2b3aa2f08f..0aed1f6612 100644
--- a/cpp/src/interop/to_arrow.cu
+++ b/cpp/src/interop/to_arrow.cu
@@ -374,6 +374,10 @@ std::shared_ptr<arrow::Array> dispatch_to_arrow::operator()<cudf::list_view>(
   column_view input_view = (tmp_column != nullptr) ? tmp_column->view() : input;
   auto children_meta =
     metadata.children_meta.empty() ? std::vector<column_metadata>{{}, {}} : metadata.children_meta;
+
+  CUDF_EXPECTS(metadata.children_meta.size() == static_cast<std::size_t>(input_view.num_children()),
+               "Number of field names and number of children doesn't match\n");
+
   auto child_arrays = fetch_child_array(input_view, children_meta, ar_mr, stream);
   if (child_arrays.empty()) {
     return std::make_shared<arrow::ListArray>(arrow::list(arrow::null()), 0, nullptr, nullptr);

However, the bigger question is what the preferred approach here is. It seems like the principle of least surprise would lead one to think, as you did, that for a list column one should provide a single metadata entry corresponding for the "element type" that's interior to the list. After all, the offsets are an implementation detail. But I don't know if changing those semantics would break any other usage.

lithomas1 commented 3 weeks ago

However, the bigger question is what the preferred approach here is. It seems like the principle of least surprise would lead one to think, as you did, that for a list column one should provide a single metadata entry corresponding for the "element type" that's interior to the list. After all, the offsets are an implementation detail. But I don't know if changing those semantics would break any other usage.

I haven't looked at libcudf internals, but I'd think that offset information is exposed in e.g. table_metadata which is returned in the readers.

On the pylibcudf side, we could hide this complexity by exposing the metadata_from_arrow_type helper you fixed for me here, as I think libcudf types map 1-1 to Arrow types. (Or, we could directly accept a pyarrow schema instead and call metadata_from_arrow_type internally.)

P.S. Me and @vyasr talked about how interop would look like in the future, and I think for columns/scalars we were leaning towards using the buffer protocol/arrow c data interface/cuda array interface to generically support interop. So for those types at the very least, you could just do np.array(plc column), etc. without dealing with interop. (we were planning to bring it up to the group on Monday)

This doesn't work for tables, unfortunately, since table metadata is not stored with libcudf table.

vyasr commented 3 weeks ago

My general approach to pylibcudf to this point has been to keep it as a minimal, faithful export of libcudf algorithms to Python without adding much in the way of syntactic sugar to improve the API. I do think that work is worth doing, I just haven't prioritized it since my assumption is that internal usage of pylibcudf (inside cuDF classic and cudf.polars) is going to dwarf any external usage until we have a proper standalone package anyway. As a result there are a lot of APIs like this that have very sharp edges. I'm not sure how much effort we should invest into improving them just yet.

That said, in this particular case since we do a lot of Arrow interop in testing (especially interactively) it's probably worth making developers lives easier if possible. The question then is, do we want to make users create ColumnMetadata objects, or should we support something more ergonomic? I think #15130 captures the analogous questions for groupby-agg APIs and maybe is instructive as to how I'm thinking about a good API there. Do you think similar ideas would apply here?

wence- commented 3 weeks ago

For testing at the pylibcudf level, do we care about the metadata attached to arrow objects at all? libcudf doesn't care about metadata. So either pylibcudf also doesn't care. Or pylibcudf does care, but then it needs a principled way of attaching metadata to columns, I think.

vyasr commented 3 weeks ago

My claim is that pylibcudf doesn't care, but users may want to produce arrow objects with metadata and so interop is the only place where there should be a principled way to do this right? I don't think there needs to be metadata attached to columns, but there should be a way to produce arrow arrays from columns that allows the attachment of metadata upon array creation.

lithomas1 commented 3 weeks ago

A solution could maybe be to make users that want to have the metadata follow the pylibcudf Table around use the TableWithMetadata class I added for I/O. (pylibcudf operations would then either accept TableWithMetadata or Table).

This also solves the usability issue of users having to keep track of metadata by hand after an I/O operation. (since after I/O users get a TableWithMetadata, but they have to keep track of column metadata manually, which is easy to get wrong, especially when the dtypes change after operations)

Then, we could make to_arrow on a pylibcudf Table give an Arrow table with default names. (And for users that want their own names, they can construct a pylibcudf TableWithMetadata from the Table and their own column names).

lithomas1 commented 2 weeks ago

Yeah, we can do something like this:

diff --git a/cpp/src/interop/to_arrow.cu b/cpp/src/interop/to_arrow.cu
index 2b3aa2f08f..0aed1f6612 100644
--- a/cpp/src/interop/to_arrow.cu
+++ b/cpp/src/interop/to_arrow.cu
@@ -374,6 +374,10 @@ std::shared_ptr<arrow::Array> dispatch_to_arrow::operator()<cudf::list_view>(
   column_view input_view = (tmp_column != nullptr) ? tmp_column->view() : input;
   auto children_meta =
     metadata.children_meta.empty() ? std::vector<column_metadata>{{}, {}} : metadata.children_meta;
+
+  CUDF_EXPECTS(metadata.children_meta.size() == static_cast<std::size_t>(input_view.num_children()),
+               "Number of field names and number of children doesn't match\n");
+
   auto child_arrays = fetch_child_array(input_view, children_meta, ar_mr, stream);
   if (child_arrays.empty()) {
     return std::make_shared<arrow::ListArray>(arrow::list(arrow::null()), 0, nullptr, nullptr);

However, the bigger question is what the preferred approach here is. It seems like the principle of least surprise would lead one to think, as you did, that for a list column one should provide a single metadata entry corresponding for the "element type" that's interior to the list. After all, the offsets are an implementation detail. But I don't know if changing those semantics would break any other usage.

@wence- Were you going to submit a patch for this?