rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.18k stars 880 forks source link

[BUG] DataFrame iloc indexing is incorrect for repeated index entries in the "columns" part of the key #13266

Closed wence- closed 2 weeks ago

wence- commented 1 year ago

Describe the bug

import cudf
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(4).reshape(2, 2))
cdf = cudf.from_pandas(df)

df.iloc[:, [0, 1, 0]]
#    0  1  0
# 0  0  1  0
# 1  2  3  2

cdf.iloc[:, [0, 1, 0]]
#    0  1
# 0  0  1
# 1  2  3

This is because ColumnAccessor.select_by_index uniquifies input index arguments.

Expected behavior

This should match pandas.

wence- commented 1 year ago

This is probably a consequence of cudf not supporting duplicate column names:

pd.DataFrame(np.arange(4).reshape(2,2), columns=["a", "a"])
#    a  a
# 0  0  1
# 1  2  3
cudf.DataFrame(np.arange(4).reshape(2,2), columns=["a", "a"])
#    a
# 0  1
# 1  3

Yes, see #13273, the end result will depend on what we do for that case.

mroeschke commented 2 weeks ago

Looks you had "fixed" this in commit https://github.com/rapidsai/cudf/commit/e0ffbd72e92 by raising a ValueError. (I'll open a separate issue for cudf to support duplicate labels) so closing this out