rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.42k stars 899 forks source link

[BUG] regexp: word boundaries `\b` and `\B` inconsistent with Java/Python around `_` #11062

Closed anthony-chang closed 2 years ago

anthony-chang commented 2 years ago

Describe the bug cuDF matches positions around _ as non-word boundaries but Python/Java does not. This was found by the fuzz tests while working on NVIDIA/spark-rapids#5692

Steps/Code to reproduce bug

>>> import cudf 
>>> cudf.Series(['_']).str.replace(r'\B', '@', regex=True)
0    @_@
dtype: object

>>> cudf.Series(['_', '_,', '_,a', '_,_']).str.contains(r'\B', regex=True)
0    True
1    True
2    True
3    True
dtype: bool
>>> import pandas as pd
>>> pd.Series(['_']).str.replace(r'\B', '@', regex=True)
0    _
dtype: object

>>> pd.Series(['_', '_,', '_,a', '_,_']).str.contains(r'\B', regex=True)
0    False
1     True
2    False
3    False
dtype: bool

Expected behavior I would like to match the Python/Java behaviour.

Environment overview (please complete the following information)

Environment details

Click here to see environment details

     **git***
     commit c01a2a41b7d57a1360324270101d9304fbc9515f (HEAD -> branch-22.08, rapidsai/branch-22.08)
     Author: Karthikeyan <6488848+karthikeyann@users.noreply.github.com>
     Date:   Thu Jun 2 06:39:56 2022 +0530

     Fix Doxygen warnings in table header files (#10964)

     Fixes parts of https://github.com/rapidsai/cudf/issues/9373
     added missing documentation to fix doxygen warnings in table headers
     ignores doc generation for `strong_index_comparator_adapter`

     fixes 166  warnings.

     Authors:
     - Karthikeyan (https://github.com/karthikeyann)

     Approvers:
     - David Wendt (https://github.com/davidwendt)
     - Vyas Ramasubramani (https://github.com/vyasr)

     URL: https://github.com/rapidsai/cudf/pull/10964
     **git submodules***

     ***OS Information***
     DISTRIB_ID=Ubuntu
     DISTRIB_RELEASE=18.04
     DISTRIB_CODENAME=bionic
     DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS"
     NAME="Ubuntu"
     VERSION="18.04.5 LTS (Bionic Beaver)"
     ID=ubuntu
     ID_LIKE=debian
     PRETTY_NAME="Ubuntu 18.04.5 LTS"
     VERSION_ID="18.04"
     HOME_URL="https://www.ubuntu.com/"
     SUPPORT_URL="https://help.ubuntu.com/"
     BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
     PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
     VERSION_CODENAME=bionic
     UBUNTU_CODENAME=bionic
     Linux c240m5-01 5.4.0-109-generic #123~18.04.1-Ubuntu SMP Fri Apr 8 09:48:52 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

     ***GPU Information***
     Mon Jun  6 17:19:14 2022
     +-----------------------------------------------------------------------------+
     | NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
     |-------------------------------+----------------------+----------------------+
     | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     |                               |                      |               MIG M. |
     |===============================+======================+======================|
     |   0  Tesla T4            On   | 00000000:19:00.0 Off |                    0 |
     | N/A   53C    P0    39W /  70W |  11809MiB / 15109MiB |    100%      Default |
     |                               |                      |                  N/A |
     +-------------------------------+----------------------+----------------------+
     |   1  Tesla T4            On   | 00000000:5E:00.0 Off |                    0 |
     | N/A   46C    P0    27W /  70W |   2307MiB / 15109MiB |      8%      Default |
     |                               |                      |                  N/A |
     +-------------------------------+----------------------+----------------------+
     |   2  Tesla T4            On   | 00000000:86:00.0 Off |                    0 |
     | N/A   44C    P0    27W /  70W |   5273MiB / 15109MiB |      8%      Default |
     |                               |                      |                  N/A |
     +-------------------------------+----------------------+----------------------+
     |   3  Tesla T4            On   | 00000000:AF:00.0 Off |                    0 |
     | N/A   50C    P0    40W /  70W |   3330MiB / 15109MiB |     61%      Default |
     |                               |                      |                  N/A |
     +-------------------------------+----------------------+----------------------+

     +-----------------------------------------------------------------------------+
     | Processes:                                                                  |
     |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
     |        ID   ID                                                   Usage      |
     |=============================================================================|
     |    0   N/A  N/A      1924      G   /usr/lib/xorg/Xorg                  4MiB |
     |    0   N/A  N/A     30057      C   ...penjdk-amd64/jre/bin/java      403MiB |
     |    0   N/A  N/A     42914      C   ...penjdk-amd64/jre/bin/java     9689MiB |
     |    0   N/A  N/A     43651      C   ...penjdk-amd64/jre/bin/java      467MiB |
     |    0   N/A  N/A     52020      C   ...penjdk-amd64/jre/bin/java      403MiB |
     |    0   N/A  N/A     61698      C   ...penjdk-amd64/jre/bin/java      403MiB |
     |    0   N/A  N/A     62795      C   python                            435MiB |
     |    1   N/A  N/A      1924      G   /usr/lib/xorg/Xorg                  4MiB |
     |    1   N/A  N/A     23901      C   python                            935MiB |
     |    1   N/A  N/A     29081      C   /usr/bin/python                  1365MiB |
     |    2   N/A  N/A      1924      G   /usr/lib/xorg/Xorg                  4MiB |
     |    2   N/A  N/A     40055      C   python                            435MiB |
     |    2   N/A  N/A     60481      C   /opt/conda/bin/python            4831MiB |
     |    3   N/A  N/A      1924      G   /usr/lib/xorg/Xorg                  4MiB |
     |    3   N/A  N/A     23901      C   python                            935MiB |
     |    3   N/A  N/A     29081      C   /usr/bin/python                   951MiB |
     |    3   N/A  N/A     40055      C   python                            435MiB |
     |    3   N/A  N/A     60481      C   /opt/conda/bin/python             565MiB |
     |    3   N/A  N/A     62795      C   python                            435MiB |
     +-----------------------------------------------------------------------------+

     ***CPU***
     Architecture:        x86_64
     CPU op-mode(s):      32-bit, 64-bit
     Byte Order:          Little Endian
     CPU(s):              72
     On-line CPU(s) list: 0-71
     Thread(s) per core:  2
     Core(s) per socket:  18
     Socket(s):           2
     NUMA node(s):        2
     Vendor ID:           GenuineIntel
     CPU family:          6
     Model:               85
     Model name:          Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
     Stepping:            4
     CPU MHz:             3191.110
     CPU max MHz:         3700.0000
     CPU min MHz:         1200.0000
     BogoMIPS:            6000.00
     Virtualization:      VT-x
     L1d cache:           32K
     L1i cache:           32K
     L2 cache:            1024K
     L3 cache:            25344K
     NUMA node0 CPU(s):   0-17,36-53
     NUMA node1 CPU(s):   18-35,54-71
     Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d

     ***CMake***

     ***g++***
     /usr/bin/g++
     g++ (Ubuntu 9.3.0-11ubuntu0~18.04.1) 9.3.0
     Copyright (C) 2019 Free Software Foundation, Inc.
     This is free software; see the source for copying conditions.  There is NO
     warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

     ***nvcc***

     ***Python***
     /home/antchang/miniconda3/envs/cudf/bin/python
     Python 3.9.13

     ***Environment Variables***
     PATH                            : /home/antchang/miniconda3/envs/cudf/bin:/home/antchang/.poetry/bin:/home/antchang/miniconda3/condabin:/home/antchang/.pyenv/shims:/home/antchang/.pyenv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/antchang/spark/bin:/home/antchang/spark/sbin
     LD_LIBRARY_PATH                 :
     NUMBAPRO_NVVM                   :
     NUMBAPRO_LIBDEVICE              :
     CONDA_PREFIX                    : /home/antchang/miniconda3/envs/cudf
     PYTHON_PATH                     :

     ***conda packages***
     /home/antchang/miniconda3/condabin/conda
     # packages in environment at /home/antchang/miniconda3/envs/cudf:
     #
     # Name                    Version                   Build  Channel
     _libgcc_mutex             0.1                 conda_forge    conda-forge
     _openmp_mutex             4.5                       2_gnu    conda-forge
     abseil-cpp                20210324.2           h9c3ff4c_0    conda-forge
     arrow-cpp                 7.0.0           py39he577829_7_cuda    conda-forge
     arrow-cpp-proc            3.0.0                      cuda    conda-forge
     aws-c-cal                 0.5.11               h95a6274_0    conda-forge
     aws-c-common              0.6.2                h7f98852_0    conda-forge
     aws-c-event-stream        0.2.7               h3541f99_13    conda-forge
     aws-c-io                  0.10.5               hfb6a706_0    conda-forge
     aws-checksums             0.1.11               ha31a3da_7    conda-forge
     aws-sdk-cpp               1.8.186              hb4091e7_3    conda-forge
     bzip2                     1.0.8                h7f98852_4    conda-forge
     c-ares                    1.18.1               h7f98852_0    conda-forge
     ca-certificates           2022.5.18.1          ha878542_0    conda-forge
     cachetools                5.0.0              pyhd8ed1ab_0    conda-forge
     cuda-python               11.7.0           py39h3fd9d12_0    nvidia
     cudatoolkit               11.6.0               habf752d_9    nvidia
     cudf                      22.06.00a220531 cuda_11_py39_gd0b4e3032c_317    rapidsai-nightly
     cupy                      10.5.0           py39hc3c280e_0    conda-forge
     dlpack                    0.5                  h9c3ff4c_0    conda-forge
     fastavro                  1.4.12           py39hb9d737c_0    conda-forge
     fastrlock                 0.8              py39h5a03fae_2    conda-forge
     fsspec                    2022.5.0           pyhd8ed1ab_0    conda-forge
     gflags                    2.2.2             he1b5a44_1004    conda-forge
     glog                      0.6.0                h6f12383_0    conda-forge
     grpc-cpp                  1.45.2               hd8f4eba_3    conda-forge
     keyutils                  1.6.1                h166bdaf_0    conda-forge
     krb5                      1.19.3               h3790be6_0    conda-forge
     ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
     libblas                   3.9.0           14_linux64_openblas    conda-forge
     libbrotlicommon           1.0.9                h166bdaf_7    conda-forge
     libbrotlidec              1.0.9                h166bdaf_7    conda-forge
     libbrotlienc              1.0.9                h166bdaf_7    conda-forge
     libcblas                  3.9.0           14_linux64_openblas    conda-forge
     libcrc32c                 1.1.2                h9c3ff4c_0    conda-forge
     libcudf                   22.06.00a220531 cuda11_gd0b4e3032c_317    rapidsai-nightly
     libcurl                   7.83.1               h7bff187_0    conda-forge
     libedit                   3.1.20191231         he28a2e2_2    conda-forge
     libev                     4.33                 h516909a_1    conda-forge
     libevent                  2.1.10               h9b69904_4    conda-forge
     libffi                    3.4.2                h7f98852_5    conda-forge
     libgcc-ng                 12.1.0              h8d9b700_16    conda-forge
     libgfortran-ng            12.1.0              h69a702a_16    conda-forge
     libgfortran5              12.1.0              hdcd56e2_16    conda-forge
     libgomp                   12.1.0              h8d9b700_16    conda-forge
     libgoogle-cloud           1.40.2               habd0e3a_0    conda-forge
     liblapack                 3.9.0           14_linux64_openblas    conda-forge
     libllvm11                 11.1.0               hf817b99_3    conda-forge
     libnghttp2                1.47.0               h727a467_0    conda-forge
     libnsl                    2.0.0                h7f98852_0    conda-forge
     libopenblas               0.3.20          pthreads_h78a6416_0    conda-forge
     libprotobuf               3.20.1               h6239696_0    conda-forge
     librmm                    22.06.00a220531 cuda11_g914cb4c8_75    rapidsai-nightly
     libssh2                   1.10.0               ha56f1ee_2    conda-forge
     libstdcxx-ng              12.1.0              ha89aaad_16    conda-forge
     libthrift                 0.16.0               h519c5ea_1    conda-forge
     libutf8proc               2.7.0                h7f98852_0    conda-forge
     libuuid                   2.32.1            h7f98852_1000    conda-forge
     libzlib                   1.2.12               h166bdaf_0    conda-forge
     llvmlite                  0.38.1           py39h7d9a04d_0    conda-forge
     lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
     ncurses                   6.3                  h27087fc_1    conda-forge
     numba                     0.55.1           py39h66db6d7_1    conda-forge
     numpy                     1.21.6           py39h18676bf_0    conda-forge
     nvtx                      0.2.3            py39h3811e60_1    conda-forge
     openssl                   1.1.1o               h166bdaf_0    conda-forge
     orc                       1.7.3                h6c59b99_1    conda-forge
     packaging                 21.3               pyhd8ed1ab_0    conda-forge
     pandas                    1.4.2            py39h1832856_2    conda-forge
     parquet-cpp               1.5.1                         2    conda-forge
     pip                       22.1.2             pyhd8ed1ab_0    conda-forge
     protobuf                  3.20.1           py39h5a03fae_0    conda-forge
     ptxcompiler               0.2.0            py39h107f55c_0    rapidsai-nightly
     pyarrow                   7.0.0           py39h1ed2e5d_7_cuda    conda-forge
     pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
     python                    3.9.13          h9a8a25e_0_cpython    conda-forge
     python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
     python_abi                3.9                      2_cp39    conda-forge
     pytz                      2022.1             pyhd8ed1ab_0    conda-forge
     re2                       2022.04.01           h27087fc_0    conda-forge
     readline                  8.1                  h46c0cb4_0    conda-forge
     rmm                       22.06.00a220531 cuda11_py39_g914cb4c8_75    rapidsai-nightly
     s2n                       1.0.10               h9b69904_0    conda-forge
     setuptools                62.3.2           py39hf3d152e_0    conda-forge
     six                       1.16.0             pyh6c4a22f_0    conda-forge
     snappy                    1.1.9                hbd366e4_1    conda-forge
     spdlog                    1.8.5                h4bd325d_1    conda-forge
     sqlite                    3.38.5               h4ff8645_0    conda-forge
     tk                        8.6.12               h27826a3_0    conda-forge
     typing_extensions         4.2.0              pyha770c72_1    conda-forge
     tzdata                    2022a                h191b570_0    conda-forge
     wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
     xz                        5.2.5                h516909a_1    conda-forge
     zlib                      1.2.12               h166bdaf_0    conda-forge
     zstd                      1.5.2                h8a70e8d_1    conda-forge

Additional context None

anthony-chang commented 2 years ago

I've edited to also include the same problem for \b

>>> cudf.Series(['_']).str.replace(r'\b', '@', regex=True)
0    _
dtype: object
>>> pd.Series(['_']).str.replace(r'\b', '@', regex=True)
0    @_@
dtype: object
anthony-chang commented 2 years ago

There also are some inconsistencies with non-word boundary \B specifically in string split around some non alpha-numeric characters.

>>> cudf.Series([':', '(', ')', ';', ',', '.', '<', '>', '[', ']', '!', '@', '#', '$', '%', '^', '&', '*', '`', '~', '-', '_', '+', '=', '|', '\\', '\'', '"']).str.split(r'\B', regex=True)
0     [, :]
1     [, (]
2     [, )]
3     [, ;]
4     [, ,]
5     [, .]
6     [, <]
7     [, >]
8     [, []
9     [, ]]
10    [, !]
11    [, @]
12    [, #]
13    [, $]
14    [, %]
15    [, ^]
16    [, &]
17    [, *]
18    [, `]
19    [, ~]
20    [, -]
21    [, _]
22    [, +]
23    [, =]
24    [, |]
25    [, \]
26    [, ']
27    [, "]
dtype: list
>>> pd.Series([':', '(', ')', ';', ',', '.', '<', '>', '[', ']', '!', '@', '#', '$', '%', '^', '&', '*', '`', '~', '-', '_', '+', '=', '|', '\\', '\'', '"']).str.split(r'\B', regex=True)
0     [, :, ]
1     [, (, ]
2     [, ), ]
3     [, ;, ]
4     [, ,, ]
5     [, ., ]
6     [, <, ]
7     [, >, ]
8     [, [, ]
9     [, ], ]
10    [, !, ]
11    [, @, ]
12    [, #, ]
13    [, $, ]
14    [, %, ]
15    [, ^, ]
16    [, &, ]
17    [, *, ]
18    [, `, ]
19    [, ~, ]
20    [, -, ]
21        [_]
22    [, +, ]
23    [, =, ]
24    [, |, ]
25    [, \, ]
26    [, ', ]
27    [, ", ]
dtype: object

For word boundary \b, only _ seems to be problematic

>>> cudf.Series(['_']).str.split(r'\b', regex=True)
0    [_]
dtype: list
>>> pd.Series(['_']).str.split(r'\b', regex=True)
0    [, _, ]
dtype: object
davidwendt commented 2 years ago

There also are some inconsistencies with non-word boundary \B specifically in string split around some non alpha-numeric characters.

Ignoring the _ example, I assume the concern is the number of tokens produced by split? If so, this appears to be a separate issue specific to split and \b and \B:

>>> import pandas as pd
>>> import cudf

>>> cudf.Series(['ab', '-+']).str.split(r'\b', regex=True)
0    [, ab]
1      [-+]
dtype: list
>>> pd.Series(['ab', '-+']).str.split(r'\b', regex=True)
0    [, ab, ]
1        [-+]
dtype: object

>>> cudf.Series(['ab', '-+']).str.split(r'\B', regex=True)
0      [a, b]
1    [, -, +]
dtype: list
>>> pd.Series(['ab', '-+']).str.split(r'\B', regex=True)
0        [a, b]
1    [, -, +, ]
dtype: object
anthony-chang commented 2 years ago

There also are some inconsistencies with non-word boundary \B specifically in string split around some non alpha-numeric characters.

Ignoring the _ example, I assume the concern is the number of tokens produced by split? If so, this appears to be a separate issue specific to split and \b and \B:

Right, my bad this isn't just limited to some characters. Should I open a separate issue for this?

davidwendt commented 2 years ago

There also are some inconsistencies with non-word boundary \B specifically in string split around some non alpha-numeric characters.

Ignoring the _ example, I assume the concern is the number of tokens produced by split? If so, this appears to be a separate issue specific to split and \b and \B:

Right, my bad this isn't just limited to some characters. Should I open a separate issue for this?

Yes, I think so. The split fix could be involved and would go into a separate PR at least.

anthony-chang commented 2 years ago

Opened https://github.com/rapidsai/cudf/issues/11102