Open brandon-b-miller opened 2 years ago
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
I believe @brandon-b-miller is actively working on this.
Whilst I'm commenting, I'll add a note that numba/numba#7621 helps support this implementation so may be a useful reference along with numba/numba-examples#40, which requires a similar mechanism of linking CUDA C/C++ with Numba kernels.
This is being worked on, albeit slowly for now. We've had a lot of discussions of how we intend to proceed with this offline, but the general consensus is that some of these functions will be a lot easier to support than others, namely the ones that have predictable memory requirements. Hopefully more to come here soon.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
This is still being worked on.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
This is still active.
@brandon-b-miller Should this be moved to a different project board?
@brandon-b-miller do you want to keep using this issue to track the remaining work as well (the methods that output strings)?
Just wanted to provide an update on this feature since we now have partial support for this and I think we have a clear picture of what's left to be done and a tentative timeline. Here is a summary.
22.10 introduced string udfs via the strings_udf
library
With the merge of https://github.com/rapidsai/cudf/pull/11319 (as well as a flurry of follow up fixes), a new separately installable package strings_udf
was rolled out to support this. When present in the users environment, users will find that they are able to pass string columns to UDFs through DataFrame.apply
and Series.apply
and utilize the following hopefully familiar python methods within those UDFs:
str.count()
str.startswith()
str.endswith()
str.find()
str.rfind()
str.isalnum()
str.isalpha()
str.isdecimal()
str.isdigit()
str.islower()
str.isupper()
str.isnumeric()
str.isspace()
str.istitle()
==
, !=
, >
, <
, <=
, >=
)str in other
)len(str)
CEC CUDA 11.5 is currently required for this feature. CUDA enhanced compatibility is pending with PR https://github.com/rapidsai/cudf/pull/11884.
More features (methods that produce non numeric data) Functions and methods that return strings are being worked on for 22.12 with the main PR implementing the bulk of the plumbing at https://github.com/rapidsai/cudf/pull/11933. After this is merged, the following features will be added in phases:
str.capitalize()
str.upper()
str.lower()
str.swapcase()
str.ljust
str.rjust
,str.strip
str.lstrip
str.rstrip
str.removeprefix
str.removesuffix
str.title
str.center
str.expandtabs
str.replace
str.zfill
str.index
str.rindex
str[1:3]
)+
operator between strings)for char in str:
)The above functions currently require cuda dynamic global memory allocation and can therefore have some unpredictable performance characteristics. We hope to make this problem go away in the future.
Wont add for now
Some features like formatting are not yet on the roadmap, in addition to functions with structured return types such as split
which returns a list.
Hopefully this gives us something to work with for now and hopefully more updates to this thread in the future!
@brandon-b-miller could you update this issue with the current state of UDFs?
Is your feature request related to a problem? Please describe. Currently we can't use string columns inside UDFs. This is for a number of reasons. Firstly, there is limited support for strings in general in Numba, which forms the basis of our UDF framework. Secondly even if strings were supported in numba, we would still need to extend numba for it to be able to properly generate kernels that work as we expect on the buffers containing our string data. Lastly, there are special memory considerations on the GPU that complicate the situation further.
Describe the solution you'd like Recently @davidwendt has experimented with a c++ class which solves many of the nuances around handling single strings that live on the device inside UDFs. @gmarkall subsequently wrote a proof of concept showing how simple string functions such as
len
can be overloaded using numba to map to the methods contained in that c++ class and baked into a kernel. We would like to plumb this machinery through cuDF. This roughly consists of the following steps:capitalize
casefold
center
count
encode
endswith
expandtabs
find
format
format_map
index
isalnum
isalpha
isascii
isdecimal
isdigit
islower
isprintable
isspace
istitle
isupper
join
ljust
lower
lstrip
maketrans
removeprefix
removesuffix
replace
rfind
rindex
rjust
rpartition
rsplit
rstrip
split
splitlines
startswith
swapcase
title
translate
upper
zfill
Concretely, when we encounter a UDF that is written like this for example:
Our code should
len
that we will write which expects aMaskedType(string)
and returns aMaskedType(int64)
len
method when provided a pointer to the start of the stringDescribe alternatives you've considered
Additional context If we can get this to work it lays the groundwork for being able to use other more complex types inside UDFs in the future, following the same pattern of using numba to map python code to external function calls that we write to operate on a single data element.
Similar issue for
applymap
https://github.com/rapidsai/cudf/issues/3802