rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.38k stars 896 forks source link

[FEA] Accelerate conversion from `arrow::StringViewType` to `arrow::StringType` in libcudf interop #15298

Open GregoryKimball opened 7 months ago

GregoryKimball commented 7 months ago

Is your feature request related to a problem? Please describe. The Arrow 15 specification includes a definition of "arrow::StringViewType" - an alternate representation of the "arrow::StringType". You may find "String view" also referred to as Umbra string or prefix string.

A string view consists of two columns:

  1. A column of 16 byte fixed-width elements. First 4 bytes contain the string size
    • If size < 12, then the string is stored inline in the remaining 12 bytes (short string optimization)
    • If size > 12, then the string is stored separately in the second column. Remaining 12 bytes are 8 bytes for pointer to the string + 4 bytes for the first 4 chars of the string
  2. A column of characters storing the suffix strings

String view type enables some performance optimizations:

Describe the solution you'd like Let's add interop support for string view in from_arrow with CUDA C++ code to accept string views and convert them to libcudf strings columns. We may also want to add string view compatibility to to_arrow, so we can hand off libcudf strings columns to host libraries that expect string views. We should be able to write CUDA C++ code to efficiently transform arrow::StringViewType buffers in to arrow::StringType buffers.

Describe alternatives you've considered Force libcudf users to convert their string views into strings on the host before passing the data to the device.

Additional context Velox supports a string view type (ref1, ref2), Polars has switched to a string view representation, and DuckDB supports string view.

We may choose to investigate using string views in libcudf at some point, but for the foreseeable future string view refactoring will be lower priority than supporting large strings and improving performance with long strings.

JayjeetAtGithub commented 2 months ago

Interop example for arrow::StringViewArray to cudf::column in #16498 . We can integrate this example into the interop module once nanoarrow supports string view types (discussion).

GregoryKimball commented 2 weeks ago

This may be unblocked by https://github.com/apache/arrow-nanoarrow/pull/596 now