pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.62k stars 17.91k forks source link

BUG: "str.contains" match groups UserWarning discourages best practices & slows performance #56798

Open bionicles opened 9 months ago

bionicles commented 9 months ago

Pandas version checks

Reproducible Example

test_regex_groups_warning.py

import pandas as pd

fruits = pd.Series(["apple", "banana", "apple and banana"])

def test_user_warning_in_series():
    print("fruits:\n", fruits)
    print("UNNAMED REGEX")  # less accessible
    nand_query = r"^(?!(?=.*apple)(?=.*banana)).*$"
    print(f"'{nand_query}'")
    filtered = fruits[fruits.str.contains(nand_query)]
    print("filtered:\n", filtered)
    nand_condition = ~(
        filtered.str.contains("head", case=False)
        & filtered.str.contains("neck", case=False)
    )
    print("nand_condition:\n", nand_condition)
    if not nand_condition.all():
        failures = filtered[~nand_condition]
        print("failures:\n", failures)
    assert nand_condition.all()
    assert 0

def test_user_warning_with_names():
    print("fruits:\n", fruits)
    print("NAMED REGEX")  # more clear about what is matched and why:
    named_nand_query = r"(?P<nand>^(?P<not_>(?!(?P<and_>(?=.*apple)(?=.*banana)))).*$)"
    print(f"'{named_nand_query}'")
    filtered_named = fruits[fruits.str.contains(named_nand_query)]
    print("filtered_named:\n", filtered_named)
    named_nand_condition = ~(
        filtered_named.str.contains("head", case=False)
        & filtered_named.str.contains("neck", case=False)
    )
    print("named_nand_condition:\n", named_nand_condition)
    if not named_nand_condition.all():
        named_failures = filtered_named[~named_nand_condition]
        print("named_failures:\n", named_failures)
    assert named_nand_condition.all()
    assert 0

Issue Description

image

@phofl wrote re: PR 56763 to remove this warning

Please open issues to facilitate a discussion before you open a PR

Sorry, my bad, here is an issue to discuss the warning I proposed to remove.

PRE-TL;DR:

result of running reproducible example:

image

This warning is annoying and serves to discourage use of named capturing groups (which are more maintainable) because users must either switch to extracting the groups (not always necessary) or replace their named groups with "(?:" (noncapturing groups are harder to maintain because it's less clear what is their objective within a greater regex pattern).

If users need to specialize their regex patterns to each command, then they need to maintain multiple copies, some with non-captured groups, some without, just to silence some warning, also, if they remove the groups, then later on when they want to use them, they might have to figure out how to replace groups they removed just to silence a warning, and be frustrated.

The logical condition for the warning also compiles the pattern but then the compiled pattern is discarded, so this warning slows down every "contains" in pandas just to check if we should annoy people who probably know what they're doing.

If we remove this unnecessary warning, then we no longer discourage users who use named capturing groups, thus facilitating readability of the patterns, and portability to other contexts, such as debuggers or the "extract" method mentioned in the removed warning.

TL;DR: This warning doesn't need to exist, discourages best practices, and slows performance of every string contains query in pandas (a super "hot path!"), so I suggest we remove it.

image

here is a permalink to the line of code to check for the warning condition

        dtype: bool
        """
-        if regex and re.compile(pat).groups: # compiled and discarded the pattern
-            warnings.warn(
-                "This pattern is interpreted as a regular expression, and has "
-                "match groups. To actually get the groups, use str.extract.",
-                UserWarning,
-                stacklevel=find_stack_level(),
-            )

        result = self._data.array._str_contains(pat, case, flags, na, regex)
        return self._wrap_result(result, fill_value=na, returns_string=False)

just compile the pattern and use the compiled pattern:

        dtype: bool
        """
        result = self._data.array._str_contains(pat, case, flags, na, regex)
        return self._wrap_result(result, fill_value=na, returns_string=False)
``
optionally (may cause issues depending on _str_contains logic:
```diff
- result = self._data.array._str_contains(pat, case, flags, na, regex)
+ compiled_pattern = re.compile(pat, flags)
+ result = self._data.array._str_contains(compiled_pattern, case, flags, na, regex)

Expected Behavior

no warning for containment queries using groups

Installed Versions

> python Python 3.10.9 (main, Mar 1 2023, 18:23:06) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> pd.show_versions() /home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") Traceback (most recent call last): File "", line 1, in File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pandas/util/_print_versions.py", line 141, in show_versions deps = _get_dependency_info() File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pandas/util/_print_versions.py", line 98, in _get_dependency_info mod = import_optional_dependency(modname, errors="ignore") File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pandas/compat/_optional.py", line 132, in import_optional_dependency module = importlib.import_module(name) File "/home/bion/miniconda3/envs/py310/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/numba/__init__.py", line 42, in from numba.np.ufunc import (vectorize, guvectorize, threading_layer, File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/numba/np/ufunc/__init__.py", line 3, in from numba.np.ufunc.decorators import Vectorize, GUVectorize, vectorize, guvectorize File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/numba/np/ufunc/decorators.py", line 3, in from numba.np.ufunc import _internal SystemError: initialization of _internal failed without raising an exception
> pip show pandas
Name: pandas
Version: 2.1.4
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author:
Author-email: The Pandas Development Team <pandas-dev@python.org>
License: BSD 3-Clause License

        Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
        All rights reserved.

        Copyright (c) 2011-2023, Open source contributors.

        Redistribution and use in source and binary forms, with or without
        modification, are permitted provided that the following conditions are met:

        * Redistributions of source code must retain the above copyright notice, this
          list of conditions and the following disclaimer.

        * Redistributions in binary form must reproduce the above copyright notice,
          this list of conditions and the following disclaimer in the documentation
          and/or other materials provided with the distribution.

        * Neither the name of the copyright holder nor the names of its
          contributors may be used to endorse or promote products derived from
          this software without specific prior written permission.

        THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
        AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
        IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
        DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
        FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
        DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
        SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
        CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
        OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
        OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Location: /home/bion/miniconda3/envs/py310/lib/python3.10/site-packages
Requires: numpy, python-dateutil, pytz, tzdata
bionicles commented 9 months ago

follow up: for easy wins in pandas performance, a review of all the logical conditions for warning in pandas might be warranted, if we're compiling stuff just to check if we ought to warn people, then we could remove these warnings where possible to speed everything up and simplify the codebase

rhshadrach commented 9 months ago

Thanks for the report. There can be a performance hit on the actual operation when using capturing vs non-capturing groups however. I think this might be another good case for #55385.