Is your feature request related to a problem? Please describe.
pandas currently supports duplicate column labels when constructing a DataFrame e.g. pandas.DataFrame([[0, 1]], columns=[0, 0]) and operations with these duplicate labels (indexing, index to column level transferring, etc)
cudf currently does not support this because all public data structures are essentially represented as a mapping of label -> column which requires that column labels are unique. Therefore, there are many places where we must check for duplicate labels and raise a ValueError/NotImplementedError (for eventual fallback in cudf.pandas)
Describe the solution you'd like
It would be great to support duplicate column labels without having to fall back to cudf.pandas. The most "minimal" change would probably have the ColumnAccessor also carry of dict[Hashable, list[int]] of column label -> integer positions and have the mapping of columns be dict[int, ColumnBase] of integer position -> Column
Describe alternatives you've considered
Status quo and fall back to cudf.pandas for this case.
I would recommend that we not try to address this until we rework cudf internals around pylibcudf objects. At that time we'll be reconsidering underlying data structures anyway.
Is your feature request related to a problem? Please describe. pandas currently supports duplicate column labels when constructing a
DataFrame
e.g.pandas.DataFrame([[0, 1]], columns=[0, 0])
and operations with these duplicate labels (indexing, index to column level transferring, etc)cudf currently does not support this because all public data structures are essentially represented as a mapping of
label -> column
which requires that column labels are unique. Therefore, there are many places where we must check for duplicate labels and raise aValueError
/NotImplementedError
(for eventual fallback incudf.pandas
)(This also applies to a
MultiIndex
which has "partial" support for duplicate names xref https://github.com/rapidsai/cudf/issues/10500)Describe the solution you'd like It would be great to support duplicate column labels without having to fall back to
cudf.pandas
. The most "minimal" change would probably have theColumnAccessor
also carry ofdict[Hashable, list[int]]
ofcolumn label -> integer positions
and have the mapping of columns bedict[int, ColumnBase]
ofinteger position -> Column
Describe alternatives you've considered Status quo and fall back to
cudf.pandas
for this case.Additional context xref https://github.com/rapidsai/cudf/pull/16514