rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.42k stars 900 forks source link

[FEA] Initial support for string UDFs via Numba #9639

Open brandon-b-miller opened 2 years ago

brandon-b-miller commented 2 years ago

Is your feature request related to a problem? Please describe. Currently we can't use string columns inside UDFs. This is for a number of reasons. Firstly, there is limited support for strings in general in Numba, which forms the basis of our UDF framework. Secondly even if strings were supported in numba, we would still need to extend numba for it to be able to properly generate kernels that work as we expect on the buffers containing our string data. Lastly, there are special memory considerations on the GPU that complicate the situation further.

Describe the solution you'd like Recently @davidwendt has experimented with a c++ class which solves many of the nuances around handling single strings that live on the device inside UDFs. @gmarkall subsequently wrote a proof of concept showing how simple string functions such as len can be overloaded using numba to map to the methods contained in that c++ class and baked into a kernel. We would like to plumb this machinery through cuDF. This roughly consists of the following steps:

  1. Make it so that when cuDF is built, the c++ string class and its methods are precompiled and made available as a blob of PTX or similar that we can link to when building a kernel in python.
  2. Create the pipeline in python that writes, links, compiles and executes the correct kernels that can leverage the aformentioned PTX blobs at runtime.
  3. Create numba typing and lowering that overloads calls to common string functions in python and maps them to the corresponding methods of the c++ class. Ideally we'd do all of them although some may be more complex than others due to memory considerations. Thats 43 functions:
    • capitalize
    • casefold
    • center
    • count
    • encode
    • endswith
    • expandtabs
    • find
    • format
    • format_map
    • index
    • isalnum
    • isalpha
    • isascii
    • isdecimal
    • isdigit
    • islower
    • isprintable
    • isspace
    • istitle
    • isupper
    • join
    • ljust
    • lower
    • lstrip
    • maketrans
    • removeprefix
    • removesuffix
    • replace
    • rfind
    • rindex
    • rjust
    • rpartition
    • rsplit
    • rstrip
    • split
    • splitlines
    • startswith
    • swapcase
    • title
    • translate
    • upper
    • zfill

Concretely, when we encounter a UDF that is written like this for example:

def f(row):
    return len(row['str_field'])

Our code should

Describe alternatives you've considered

Additional context If we can get this to work it lays the groundwork for being able to use other more complex types inside UDFs in the future, following the same pattern of using numba to map python code to external function calls that we write to operate on a single data element.

Similar issue for applymap https://github.com/rapidsai/cudf/issues/3802

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

gmarkall commented 2 years ago

I believe @brandon-b-miller is actively working on this.

Whilst I'm commenting, I'll add a note that numba/numba#7621 helps support this implementation so may be a useful reference along with numba/numba-examples#40, which requires a similar mechanism of linking CUDA C/C++ with Numba kernels.

brandon-b-miller commented 2 years ago

This is being worked on, albeit slowly for now. We've had a lot of discussions of how we intend to proceed with this offline, but the general consensus is that some of these functions will be a lot easier to support than others, namely the ones that have predictable memory requirements. Hopefully more to come here soon.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

gmarkall commented 2 years ago

This is still being worked on.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

gmarkall commented 2 years ago

This is still active.

gmarkall commented 2 years ago

@brandon-b-miller Should this be moved to a different project board?

vyasr commented 2 years ago

@brandon-b-miller do you want to keep using this issue to track the remaining work as well (the methods that output strings)?

brandon-b-miller commented 2 years ago

Just wanted to provide an update on this feature since we now have partial support for this and I think we have a clear picture of what's left to be done and a tentative timeline. Here is a summary.

22.10 introduced string udfs via the strings_udf library With the merge of https://github.com/rapidsai/cudf/pull/11319 (as well as a flurry of follow up fixes), a new separately installable package strings_udf was rolled out to support this. When present in the users environment, users will find that they are able to pass string columns to UDFs through DataFrame.apply and Series.apply and utilize the following hopefully familiar python methods within those UDFs:

CEC CUDA 11.5 is currently required for this feature. CUDA enhanced compatibility is pending with PR https://github.com/rapidsai/cudf/pull/11884.

More features (methods that produce non numeric data) Functions and methods that return strings are being worked on for 22.12 with the main PR implementing the bulk of the plumbing at https://github.com/rapidsai/cudf/pull/11933. After this is merged, the following features will be added in phases:

The above functions currently require cuda dynamic global memory allocation and can therefore have some unpredictable performance characteristics. We hope to make this problem go away in the future.

Wont add for now Some features like formatting are not yet on the roadmap, in addition to functions with structured return types such as split which returns a list.

Hopefully this gives us something to work with for now and hopefully more updates to this thread in the future!

vyasr commented 5 months ago

@brandon-b-miller could you update this issue with the current state of UDFs?