rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.46k stars 908 forks source link

[BUG] Arrow Dictionary indices to not prefer Unsigned #17327

Open karthikeyann opened 1 week ago

karthikeyann commented 1 week ago

Describe the bug During converting data between arrow and cudf, from_arrow converts dictionary indices to unsigned types. libcudf dictionary indices become unsigned int even if arrow has signed int.

https://github.com/rapidsai/cudf/blob/4cd40eedefdfe713df1a263a4fa0e723995520c5/cpp/src/interop/from_arrow_host.cu#L269-L280

But this causes issues with round tripping to exact types. https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout:~:text=Since%20unsigned%20integers,by%20an%20application. Also the arrow docs prefers to use signed types.

Since unsigned integers can be more difficult to work with in some cases (e.g. in the JVM), we recommend preferring signed integers over unsigned integers for representing dictionary indices. Additionally, we recommend avoiding using 64-bit unsigned integer indices unless they are required by an application.

Starting a discussion here to update this preference.

Additional context Velox does not support unsigned indices in dictionary column. while round tripping cudf table, this issue was found.

davidwendt commented 1 week ago

I believe we could change libcudf dictionary to support signed indices only without breaking anyone honestly. Or we could consider dropping dictionary support (throwing an exception) in the arrow interop since we cannot guarantee the type even works appropriately with all of libcudf at this point.

karthikeyann commented 3 days ago

@davidwendt discussed and decided to switch dictionary column to signed indices.