[BUG] For an arrow table that contains string columns and is converted from pandas, the from_arrow fails after slice because the length does not match.

infzo commented 1 year ago

Describe the bug For an arrow table that contains string columns and is converted from pandas, the from_arrow fails after slice because the length does not match.

Steps/Code to reproduce bug

import cudf
import pyarrow
import pandas as pd

cdf = pd.DataFrame.from_dict({'a': ['aa', 'bb', 'cc'], 'b': [1, 2, 3]})
print(cdf)

tbl = pyarrow.Table.from_pandas(cdf)
print(tbl)

tbl_slice = tbl.slice(0, 2)
print(tbl_slice)

gdf = cudf.DataFrame.from_arrow(tbl_slice)

>>> import cudf
>>> import pyarrow
>>> import pandas as pd
>>>
>>>
>>> cdf = pd.DataFrame.from_dict({'a': ['aa', 'bb', 'cc'], 'b': [1, 2, 3]})
>>> print(cdf)
    a  b
0  aa  1
1  bb  2
2  cc  3
>>>
>>>
>>> tbl = pyarrow.Table.from_pandas(cdf)
>>> print(tbl)
pyarrow.Table
a: string
b: int64
----
a: [["aa","bb","cc"]]
b: [[1,2,3]]
>>>
>>>
>>> tbl_slice = tbl.slice(0, 2)
>>> print(tbl_slice)
pyarrow.Table
a: string
b: int64
----
a: [["aa","bb"]]
b: [[1,2]]
>>>
>>>
>>> gdf = cudf.DataFrame.from_arrow(tbl_slice)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/huawei/release/bi/uxdf/envPkg/Miniconda3/envs/uxdf_server/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/huawei/release/bi/uxdf/envPkg/Miniconda3/envs/uxdf_server/lib/python3.9/site-packages/cudf/core/dataframe.py", line 4458, in from_arrow
    out = out.set_index(
  File "/opt/huawei/release/bi/uxdf/envPkg/Miniconda3/envs/uxdf_server/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/huawei/release/bi/uxdf/envPkg/Miniconda3/envs/uxdf_server/lib/python3.9/site-packages/cudf/core/dataframe.py", line 2453, in set_index
    df.index = idx
  File "/opt/huawei/release/bi/uxdf/envPkg/Miniconda3/envs/uxdf_server/lib/python3.9/site-packages/cudf/core/dataframe.py", line 1027, in __setattr__
    super().__setattr__(key, col)
  File "/opt/huawei/release/bi/uxdf/envPkg/Miniconda3/envs/uxdf_server/lib/python3.9/site-packages/cudf/core/indexed_frame.py", line 341, in index
    raise ValueError(
ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements
>>>

Expected behavior Translate correctly.

Environment overview (please complete the following information)

Environment location: Cloud(HuaweiCloud)
Method of cuDF install: conda

Environment details Not found.

galipremsagar commented 1 year ago

@infzo Thanks for reporting this issue. This is an index metadata-related issue. Upon investigation, we found out that it is an issue with pyarrow and raised it here: https://github.com/apache/arrow/issues/15178

Until that is resolved, we recommend using the following workaround for a sliced table:

In [17]: cudf.DataFrame.from_pandas(tbl_slice.to_pandas())
Out[17]: 
    a  b
0  aa  1
1  bb  2

galipremsagar commented 1 year ago

Closing this issue, as an upstream issue has been raised: https://github.com/apache/arrow/issues/15178

galipremsagar commented 1 year ago

Reopening since arrow might not be fixing this issue on their end: https://github.com/rapidsai/cudf/pull/12665

rapidsai / cudf

[BUG] For an arrow table that contains string columns and is converted from pandas, the from_arrow fails after slice because the length does not match. #12463