rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.28k stars 884 forks source link

[BUG] .str.findall returning incorrect results when using a quantifier with a capturing group #16730

Open JamesMaki opened 2 weeks ago

JamesMaki commented 2 weeks ago

Describe the bug cuDF .str.findall returns incorrect results with regex pattern that uses quanitifier with a capturing group.

Steps/Code to reproduce bug

reg_ex = r'(\d{4}\s){4}'
test_string = 'TEST12 1111 2222 3333 4444 5555'
import cudf
s = cudf.Series([test_string])
s.str.findall(reg_ex)
# returns 
# 0    [2 , 1 , 2 , 3 , 4 ]
# dtype: list

Note: Without the quantifier, shortening the pattern to just r'(\d{4}\s)', cuDF returns the correct results of [1111 , 2222 , 3333 , 4444 ].

Expected behavior

reg_ex = r'(\d{4}\s){4}'
test_string = 'TEST12 1111 2222 3333 4444 5555'
import re
re.findall(reg_ex, test_string)
# returns
# ['4444 ']

Environment overview (please complete the following information) Tested in latest NGC Docker image on RTX 5880 Ada and A100 SXM, also confirmed this behavior exists in the latest nightly build.

davidwendt commented 2 days ago

This has uncovered a couple of issues. First, there is a bug in libcudf when handling nested quantifiers which is addressed in PR #16798 Second, the rules for matching in findall do not match the python definition for re.findall():

The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.

The current behavior does not consider the existing of capture groups and so this will need to be addressed in a separate PR. One sticking point is the handling of multiple groups which specifies returning tuples. Since tuples are not a libcudf type, the closest result would be either a flattened list column (consecutive row elements represent the tuple) or a nested list column.