rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.32k stars 888 forks source link

[BUG] DataFrame `loc` indexing is incorrect with repeated column labels. #13269

Closed wence- closed 1 month ago

wence- commented 1 year ago

Describe the bug

This is basically #13266 but for loc, I will fix it separately due to different code paths.

import cudf
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(4).reshape(2, 2))
cdf = cudf.from_pandas(df)

df.loc[:, [0, 1, 0]]
#    0  1  0
# 0  0  1  0
# 1  2  3  2

cdf.loc[:, [0, 1, 0]]
#    0  1
# 0  0  1
# 1  2  3

This is because ColumnAccessor.select_by_label uniquifies input label arguments.

Expected behavior

This should match pandas.

wence- commented 1 year ago

This is a consequence of #13273, and how it will be fixed depends on what we do there.

mroeschke commented 1 month ago

This was fixed by https://github.com/rapidsai/cudf/pull/16514 so closing