Open karthikeyann opened 1 week ago
I believe we could change libcudf dictionary to support signed indices only without breaking anyone honestly. Or we could consider dropping dictionary support (throwing an exception) in the arrow interop since we cannot guarantee the type even works appropriately with all of libcudf at this point.
@davidwendt discussed and decided to switch dictionary column to signed indices.
Describe the bug During converting data between arrow and cudf,
from_arrow
converts dictionary indices to unsigned types. libcudf dictionary indices become unsigned int even if arrow has signed int.https://github.com/rapidsai/cudf/blob/4cd40eedefdfe713df1a263a4fa0e723995520c5/cpp/src/interop/from_arrow_host.cu#L269-L280
But this causes issues with round tripping to exact types. https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout:~:text=Since%20unsigned%20integers,by%20an%20application. Also the arrow docs prefers to use signed types.
Starting a discussion here to update this preference.
Additional context Velox does not support unsigned indices in dictionary column. while round tripping cudf table, this issue was found.