Closed anthony-chang closed 2 years ago
I've edited to also include the same problem for \b
>>> cudf.Series(['_']).str.replace(r'\b', '@', regex=True)
0 _
dtype: object
>>> pd.Series(['_']).str.replace(r'\b', '@', regex=True)
0 @_@
dtype: object
There also are some inconsistencies with non-word boundary \B
specifically in string split
around some non alpha-numeric characters.
>>> cudf.Series([':', '(', ')', ';', ',', '.', '<', '>', '[', ']', '!', '@', '#', '$', '%', '^', '&', '*', '`', '~', '-', '_', '+', '=', '|', '\\', '\'', '"']).str.split(r'\B', regex=True)
0 [, :]
1 [, (]
2 [, )]
3 [, ;]
4 [, ,]
5 [, .]
6 [, <]
7 [, >]
8 [, []
9 [, ]]
10 [, !]
11 [, @]
12 [, #]
13 [, $]
14 [, %]
15 [, ^]
16 [, &]
17 [, *]
18 [, `]
19 [, ~]
20 [, -]
21 [, _]
22 [, +]
23 [, =]
24 [, |]
25 [, \]
26 [, ']
27 [, "]
dtype: list
>>> pd.Series([':', '(', ')', ';', ',', '.', '<', '>', '[', ']', '!', '@', '#', '$', '%', '^', '&', '*', '`', '~', '-', '_', '+', '=', '|', '\\', '\'', '"']).str.split(r'\B', regex=True)
0 [, :, ]
1 [, (, ]
2 [, ), ]
3 [, ;, ]
4 [, ,, ]
5 [, ., ]
6 [, <, ]
7 [, >, ]
8 [, [, ]
9 [, ], ]
10 [, !, ]
11 [, @, ]
12 [, #, ]
13 [, $, ]
14 [, %, ]
15 [, ^, ]
16 [, &, ]
17 [, *, ]
18 [, `, ]
19 [, ~, ]
20 [, -, ]
21 [_]
22 [, +, ]
23 [, =, ]
24 [, |, ]
25 [, \, ]
26 [, ', ]
27 [, ", ]
dtype: object
For word boundary \b
, only _
seems to be problematic
>>> cudf.Series(['_']).str.split(r'\b', regex=True)
0 [_]
dtype: list
>>> pd.Series(['_']).str.split(r'\b', regex=True)
0 [, _, ]
dtype: object
There also are some inconsistencies with non-word boundary
\B
specifically in stringsplit
around some non alpha-numeric characters.
Ignoring the _
example, I assume the concern is the number of tokens produced by split
?
If so, this appears to be a separate issue specific to split
and \b
and \B
:
>>> import pandas as pd
>>> import cudf
>>> cudf.Series(['ab', '-+']).str.split(r'\b', regex=True)
0 [, ab]
1 [-+]
dtype: list
>>> pd.Series(['ab', '-+']).str.split(r'\b', regex=True)
0 [, ab, ]
1 [-+]
dtype: object
>>> cudf.Series(['ab', '-+']).str.split(r'\B', regex=True)
0 [a, b]
1 [, -, +]
dtype: list
>>> pd.Series(['ab', '-+']).str.split(r'\B', regex=True)
0 [a, b]
1 [, -, +, ]
dtype: object
There also are some inconsistencies with non-word boundary
\B
specifically in stringsplit
around some non alpha-numeric characters.Ignoring the
_
example, I assume the concern is the number of tokens produced bysplit
? If so, this appears to be a separate issue specific tosplit
and\b
and\B
:
Right, my bad this isn't just limited to some characters. Should I open a separate issue for this?
There also are some inconsistencies with non-word boundary
\B
specifically in stringsplit
around some non alpha-numeric characters.Ignoring the
_
example, I assume the concern is the number of tokens produced bysplit
? If so, this appears to be a separate issue specific tosplit
and\b
and\B
:Right, my bad this isn't just limited to some characters. Should I open a separate issue for this?
Yes, I think so. The split
fix could be involved and would go into a separate PR at least.
Describe the bug cuDF matches positions around
_
as non-word boundaries but Python/Java does not. This was found by the fuzz tests while working on NVIDIA/spark-rapids#5692Steps/Code to reproduce bug
Expected behavior I would like to match the Python/Java behaviour.
Environment overview (please complete the following information)
Environment details
Click here to see environment details
Additional context None