[FEA] Add version of extract_re that takes an index

rapidsai / cudf

cuDF - GPU DataFrame Library

https://docs.rapids.ai/api/cudf/stable/

Apache License 2.0

8.31k stars 887 forks source link

[FEA] Add version of extract_re that takes an index #9855

Open andygrove opened 2 years ago

andygrove commented 2 years ago

Is your feature request related to a problem? Please describe. From Spark, when we call extract_re we often are only interested in extracting a single group rather than all the groups in the pattern. We currently call extract_re which returns a Table and we then get the column we are interested in and discard the others. It would be more efficient if we could pass the column index to cuDF so that only one column needs instantiating.

Describe the solution you'd like I would like a signature something like extract_re(pattern, index).

Describe alternatives you've considered None

Additional context None

davidwendt commented 2 years ago

Why include groups in the pattern that will not return anything? For example, if you have the following pattern with 3 groups:

"([a-z]*)-([0-9]*)-([A-Z]*)"

and you only care about the 2nd group just change the pattern to remove the ( ) for the groups that are not needed.

"[a-z]*-([0-9]*)-[A-Z]*"

andygrove commented 2 years ago

I cannot argue with the logic here, but this is just how the Spark function works and we have no control over how people invoke it.

We could potentially rewrite the regexp pattern in the plugin to remove unreferenced groups but I would be nervous about going that far.

beckernick commented 2 years ago

Is the goal to leverage information downstream in the DAG that indicates some of the capture groups weren't actually necessary for this execution?

Could you share a bit about the use cases where this comes up? For example, I can imagine it coming up in exploratory data analysis. Curious to understand how common/significant this is.

vyasr commented 2 years ago

I am also curious to understand this problem. From the original description it doesn't seem like the request is stemming from some complex workflow where some groups are "discovered" to be unnecessary, but rather to optimize user workflows even when users invoke the function with suboptimal regexes. Naively this seems like a use case where we should be aiming to educate users as to best practices for performance rather than optimizing additional code paths, i.e. we should be documenting that including unnecessary groups in the regex will impact performance. That would be consistent with a lot of our messaging around cuDF Python where there are often ways to do things with pandas that would translate to slow cuDF solutions, and we try to document and socialize knowledge about faster ways to do those things in cuDF without trying to optimize all the slower ways. Maybe I'm missing a key reason that this use case is different, though.

andygrove commented 2 years ago

To add some more context to this request. Spark provides a SQL function regexp_extract with the signature regexp_extract(str, pattern, idx). This function was requested in SPARK-8255 to expose the equivalent Java regexp functionality. It doesn't look like too much thought was given to the inefficiency of this approach. The function has existed since Spark 1.5.0 so it is likely that people are using it. We do support this function in the RAPIDS Accelerator for Spark already but it would be more efficient if we could just request a single group to be extracted.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.