rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.49k stars 907 forks source link

[FEA] Support duplicate column labels in cudf.DataFrame #16533

Open mroeschke opened 3 months ago

mroeschke commented 3 months ago

Is your feature request related to a problem? Please describe. pandas currently supports duplicate column labels when constructing a DataFrame e.g. pandas.DataFrame([[0, 1]], columns=[0, 0]) and operations with these duplicate labels (indexing, index to column level transferring, etc)

cudf currently does not support this because all public data structures are essentially represented as a mapping of label -> column which requires that column labels are unique. Therefore, there are many places where we must check for duplicate labels and raise a ValueError/NotImplementedError (for eventual fallback in cudf.pandas)

(This also applies to a MultiIndex which has "partial" support for duplicate names xref https://github.com/rapidsai/cudf/issues/10500)

Describe the solution you'd like It would be great to support duplicate column labels without having to fall back to cudf.pandas. The most "minimal" change would probably have the ColumnAccessor also carry of dict[Hashable, list[int]] of column label -> integer positions and have the mapping of columns be dict[int, ColumnBase] of integer position -> Column

Describe alternatives you've considered Status quo and fall back to cudf.pandas for this case.

Additional context xref https://github.com/rapidsai/cudf/pull/16514

vyasr commented 3 months ago

I would recommend that we not try to address this until we rework cudf internals around pylibcudf objects. At that time we'll be reconsidering underlying data structures anyway.