Closed lithomas1 closed 1 week ago
This is because the to_arrow
interop expects, if the column_metadata
for the children of a ListColumn
is not empty, that you provide metadata for both the offsets column and the "values" column.
If you use the following:
def metadata_from_arrow_type(
pa_type: pa.Array,
name: str = "",
) -> plc.interop.ColumnMetadata | None:
metadata = plc.interop.ColumnMetadata(name) #None
if pa.types.is_list(pa_type):
child_meta = [plc.interop.ColumnMetadata("offsets")]
for i in range(pa_type.num_fields):
field_meta = metadata_from_arrow_type(pa_type.field(i).type, pa_type.field(i).name)
child_meta.append(field_meta)
metadata = plc.interop.ColumnMetadata(name, child_meta)
elif pa.types.is_struct(pa_type):
child_meta = []
for i in range(pa_type.num_fields):
field_meta = metadata_from_arrow_type(pa_type.field(i).type, pa_type.field(i).name)
child_meta.append(field_meta)
metadata = plc.interop.ColumnMetadata(
name,
# libcudf does not store field names, so just match pyarrow's.
child_meta
)
return metadata
Then there are no errors.
I debugged this by:
valgrind --vgdb=yes python bug.py
Thanks a bunch for the debugging!
This is because the
to_arrow
interop expects, if thecolumn_metadata
for the children of aListColumn
is not empty, that you provide metadata for both the offsets column and the "values" column.
This makes sense, I was thinking in terms of pyarrow ListType which doesn't have this. (This could be better documented, though, probably on the interop page - which I now realize isn't getting built in the docs.)
I'll leave this issue open, though, since I don't think the code should segfault. (For structs, a nice runtime error is raised).
Yeah, we can do something like this:
diff --git a/cpp/src/interop/to_arrow.cu b/cpp/src/interop/to_arrow.cu
index 2b3aa2f08f..0aed1f6612 100644
--- a/cpp/src/interop/to_arrow.cu
+++ b/cpp/src/interop/to_arrow.cu
@@ -374,6 +374,10 @@ std::shared_ptr<arrow::Array> dispatch_to_arrow::operator()<cudf::list_view>(
column_view input_view = (tmp_column != nullptr) ? tmp_column->view() : input;
auto children_meta =
metadata.children_meta.empty() ? std::vector<column_metadata>{{}, {}} : metadata.children_meta;
+
+ CUDF_EXPECTS(metadata.children_meta.size() == static_cast<std::size_t>(input_view.num_children()),
+ "Number of field names and number of children doesn't match\n");
+
auto child_arrays = fetch_child_array(input_view, children_meta, ar_mr, stream);
if (child_arrays.empty()) {
return std::make_shared<arrow::ListArray>(arrow::list(arrow::null()), 0, nullptr, nullptr);
However, the bigger question is what the preferred approach here is. It seems like the principle of least surprise would lead one to think, as you did, that for a list column one should provide a single metadata entry corresponding for the "element type" that's interior to the list. After all, the offsets are an implementation detail. But I don't know if changing those semantics would break any other usage.
However, the bigger question is what the preferred approach here is. It seems like the principle of least surprise would lead one to think, as you did, that for a list column one should provide a single metadata entry corresponding for the "element type" that's interior to the list. After all, the offsets are an implementation detail. But I don't know if changing those semantics would break any other usage.
I haven't looked at libcudf internals, but I'd think that offset information is exposed in e.g. table_metadata
which is returned in the readers.
On the pylibcudf side, we could hide this complexity by exposing the metadata_from_arrow_type
helper you fixed for me here, as I think libcudf types map 1-1 to Arrow types.
(Or, we could directly accept a pyarrow schema instead and call metadata_from_arrow_type
internally.)
P.S.
Me and @vyasr talked about how interop would look like in the future, and I think for columns/scalars we were leaning towards using the buffer protocol/arrow c data interface/cuda array interface to generically support interop. So for those types at the very least, you could just do np.array(plc column)
, etc. without dealing with interop.
(we were planning to bring it up to the group on Monday)
This doesn't work for tables, unfortunately, since table metadata is not stored with libcudf table.
My general approach to pylibcudf to this point has been to keep it as a minimal, faithful export of libcudf algorithms to Python without adding much in the way of syntactic sugar to improve the API. I do think that work is worth doing, I just haven't prioritized it since my assumption is that internal usage of pylibcudf (inside cuDF classic and cudf.polars) is going to dwarf any external usage until we have a proper standalone package anyway. As a result there are a lot of APIs like this that have very sharp edges. I'm not sure how much effort we should invest into improving them just yet.
That said, in this particular case since we do a lot of Arrow interop in testing (especially interactively) it's probably worth making developers lives easier if possible. The question then is, do we want to make users create ColumnMetadata
objects, or should we support something more ergonomic? I think #15130 captures the analogous questions for groupby-agg APIs and maybe is instructive as to how I'm thinking about a good API there. Do you think similar ideas would apply here?
For testing at the pylibcudf level, do we care about the metadata attached to arrow objects at all? libcudf doesn't care about metadata. So either pylibcudf also doesn't care. Or pylibcudf does care, but then it needs a principled way of attaching metadata to columns, I think.
My claim is that pylibcudf doesn't care, but users may want to produce arrow objects with metadata and so interop is the only place where there should be a principled way to do this right? I don't think there needs to be metadata attached to columns, but there should be a way to produce arrow arrays from columns that allows the attachment of metadata upon array creation.
A solution could maybe be to make users that want to have the metadata follow the pylibcudf Table around use the TableWithMetadata
class I added for I/O.
(pylibcudf operations would then either accept TableWithMetadata or Table).
This also solves the usability issue of users having to keep track of metadata by hand after an I/O operation. (since after I/O users get a TableWithMetadata, but they have to keep track of column metadata manually, which is easy to get wrong, especially when the dtypes change after operations)
Then, we could make to_arrow
on a pylibcudf Table give an Arrow table with default names.
(And for users that want their own names, they can construct a pylibcudf TableWithMetadata
from the Table and their own column names).
Yeah, we can do something like this:
diff --git a/cpp/src/interop/to_arrow.cu b/cpp/src/interop/to_arrow.cu index 2b3aa2f08f..0aed1f6612 100644 --- a/cpp/src/interop/to_arrow.cu +++ b/cpp/src/interop/to_arrow.cu @@ -374,6 +374,10 @@ std::shared_ptr<arrow::Array> dispatch_to_arrow::operator()<cudf::list_view>( column_view input_view = (tmp_column != nullptr) ? tmp_column->view() : input; auto children_meta = metadata.children_meta.empty() ? std::vector<column_metadata>{{}, {}} : metadata.children_meta; + + CUDF_EXPECTS(metadata.children_meta.size() == static_cast<std::size_t>(input_view.num_children()), + "Number of field names and number of children doesn't match\n"); + auto child_arrays = fetch_child_array(input_view, children_meta, ar_mr, stream); if (child_arrays.empty()) { return std::make_shared<arrow::ListArray>(arrow::list(arrow::null()), 0, nullptr, nullptr);
However, the bigger question is what the preferred approach here is. It seems like the principle of least surprise would lead one to think, as you did, that for a list column one should provide a single metadata entry corresponding for the "element type" that's interior to the list. After all, the offsets are an implementation detail. But I don't know if changing those semantics would break any other usage.
@wence- Were you going to submit a patch for this?
Describe the bug A clear and concise description of what the bug is.
plc.interop.to_arrow
is segfaulting when passed (possibly invalid) metadata.Steps/Code to reproduce bug Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
This code will segfault or raise memoryerror.
Expected behavior A clear and concise description of what you expected to happen.
No segfault. If the metadata being passed is wrong, an error should be raised (like in the struct case).
Environment overview (please complete the following information)
docker pull
&docker run
commands usedEnvironment details Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsAdditional context
The error seems to occur in the C++ arrow conversion code (but still could be a bug in either pylibcudf or libcudf), so I'm labelling as both a libcudf and pylibcudf issue.