Open JamesMaki opened 2 weeks ago
This has uncovered a couple of issues. First, there is a bug in libcudf when handling nested quantifiers which is addressed in PR #16798
Second, the rules for matching in findall
do not match the python definition for re.findall()
:
The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.
The current behavior does not consider the existing of capture groups and so this will need to be addressed in a separate PR. One sticking point is the handling of multiple groups which specifies returning tuples. Since tuples are not a libcudf type, the closest result would be either a flattened list column (consecutive row elements represent the tuple) or a nested list column.
Describe the bug cuDF .str.findall returns incorrect results with regex pattern that uses quanitifier with a capturing group.
Steps/Code to reproduce bug
Note: Without the quantifier, shortening the pattern to just
r'(\d{4}\s)'
, cuDF returns the correct results of[1111 , 2222 , 3333 , 4444 ]
.Expected behavior
Environment overview (please complete the following information) Tested in latest NGC Docker image on RTX 5880 Ada and A100 SXM, also confirmed this behavior exists in the latest nightly build.