python / cpython

The Python programming language
https://www.python.org
Other
63.5k stars 30.42k forks source link

Proposal: re.prefixmatch method (alias for re.match) #86519

Open gpshead opened 4 years ago

gpshead commented 4 years ago
BPO 42353
Nosy @gpshead, @ezio-melotti, @serhiy-storchaka, @msuozzo
PRs
  • python/cpython#31137
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/gpshead' closed_at = None created_at = labels = ['expert-regex', 'type-feature', 'library', '3.11'] title = 'Proposal: re.prefixmatch method (alias for re.match)' updated_at = user = 'https://github.com/gpshead' ``` bugs.python.org fields: ```python activity = actor = 'serhiy.storchaka' assignee = 'gregory.p.smith' closed = False closed_date = None closer = None components = ['Library (Lib)', 'Regular Expressions'] creation = creator = 'gregory.p.smith' dependencies = [] files = [] hgrepos = [] issue_num = 42353 keywords = ['patch'] message_count = 6.0 messages = ['380928', '380975', '380984', '380996', '412554', '412564'] nosy_count = 5.0 nosy_names = ['gregory.p.smith', 'ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'matthew.suozzo'] pr_nums = ['31137'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue42353' versions = ['Python 3.11'] ```

    gpshead commented 4 years ago

    A well known anti-pattern in Python is use of re.match when you meant to use re.search.

    re.fullmatch was added in 3.4 via https://bugs.python.org/issue16203 for similar reasons.

    re.prefixmatch would be similar: we want the re.match behavior, but want the code to be obvious about its intent. This documents the implicit \A (or ^ when not re.MULTILINE) in the name.

    The goal would be to allow linters to ultimately flag re.match as the anti-pattern when in 3.12+ mode. Asking people to use re.prefixmatch or re.search instead.

    This would help avoid bugs where people mean re.search but write re.match and would ease the cognitive load at code review time by encouraging explicit code without detailed specific Python re domain knowledge on the reviewers part.

    The implementation is trivial.

    This is not a decision to deprecate the widely used in 25 years worth of code's re.match name. That'd be painful and is unlikely to be worth doing without spreading it over a 8+ year timeline.

    serhiy-storchaka commented 4 years ago

    I seen a code which uses re.search() with anchor ^ instead of re.match(), but I never seen a code which uses re.match() instead of re.search(). It just won't work unless you add explicit ".*" or ".*?" at the start of the pattern, and it is a clear indication that re.match() matches the start of the string.

    gpshead commented 4 years ago

    My point is that re.match is a common bug when people really want re.search.

    re.prefixmatch makes it explicit and non-confusing and thus unlikely to be used wrong or misunderstood when read or reviewed.

    The term "match" when talking about regular expressions is not normally meant to imply any anchoring as anchors can be expressed within the regex. Python is relatively unique in bothering to have different methods for a prefix match and an anywhere match. (We'd have been better off without a match method entirely, only having search - too late now)

    1f9907dc-f261-461d-a62c-e5e74baf0104 commented 4 years ago

    It just won't work unless you add explicit ".*" or ".*?" at the start of the pattern

    But think of when regexes are used for validating input. Getting it to "just work" may be over-permissive validation that only actually checks the beginning of the input. They're one missed test case away from a crash or, worse, a security issue.

    This proposed name change would help make the function behavior obvious at the callsite. In the validator example, calling "prefixmatch" would stand out as wrong to even the most oblivious, documentation-averse user.

    My point is that re.match is a common bug when people really want re.search.

    While I think better distinguishing the interfaces is a nice-to-have for usability, I think it has more absolute benefit to correctness. Again, confusion around the semantics of "match" were the motivation for adding "fullmatch" in the first place but that change only went so far to address the problem: It's still too easy to misuse the existing "match" interface and it's not realistic to remove it from the language. A new name would eliminate this class of error at a very low cost.

    gpshead commented 2 years ago

    What do other APIs in widely used languages do with regex terminology? We appear to be the only popular language who anchors to the start of a string with an API named "match".

    libpcre C: uses "match" to mean what we call "search" - https://www.pcre.org/current/doc/html/pcre2_match.html

    Go: Uses "Match" to mean what we call "search" - https://pkg.go.dev/regexp#Match

    JavaScript: Uses "match" to mean what we call "search" - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/match

    Java: Uses "matches" (I think meaning what we call fullmatch?) - https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

    C++ RE2: explicit "FullMatch" and "PartialMatch" APIs - https://github.com/google/re2

    Jave re2j: uses "matches" like Java regex.Pattern - https://github.com/google/re2j

    Ruby: Uses "match" as we do "search" - https://ruby-doc.org/core-2.4.0/Regexp.html

    Rust: Uses match as we do "search" - https://docs.rs/regex/latest/regex/

    serhiy-storchaka commented 2 years ago

    I am not convinced. What are examples of using re.match() instead of re.search()? How common is this type of errors?

    There are perhaps many millions of scripts which use re.match(), deprecating re.match() at any time in future would be very destructive, and keeping an alias indefinitely would only add more confusion.

    gpshead commented 2 years ago

    From an explicit is better than implicit standpoint I still think this is a good idea. Even if never actually deprecating and removing re.match

    Just promoting use of the explicit spelling of the API call in documentation and examples should be benificial to the ecosystem as a whole. Code analyzers (pylint, etc) could then suggest prefixmatch as more explicit when it finds an re.match call.

    (I edited the opening message to make it more clear, yay markdown)

    brettcannon commented 2 years ago

    For anyone else who just learned about re.fullmatch(), the docs are at https://docs.python.org/3/library/re.html#re.fullmatch .

    The naming of re.match() is rather unfortunate. I don't know how often the mistake is made of using the wrong function/method, but I'm sure it's more than non-zero. I'm +0 on the alias.

    ezio-melotti commented 2 years ago

    I'm also not convinced :thinking:

    A well known anti-pattern in Python is use of re.match when you meant to use re.search.

    If people are not aware that re.prefixmatch exists and/or of the behavior of re.match, then they will keep using re.match, possibly incorrectly. Many are not aware of re.fullmatch either.

    The goal would be to allow linters to ultimately flag re.match as the anti-pattern when in 3.12+ mode. Asking people to use re.prefixmatch or re.search instead.

    This will only reach people that use such linters, similarly to how a warning in the docs only reaches people that do read the docs (and notice the warning). Both could be applied to mitigate the problem, but neither are definitive solutions, and adding a new function has a cost. So, unless I'm missing something, the addition will only serve to provide a more explicit alternative for the subset of users that use a linter or that are aware of the function and decide to use it.

    However, an alternative already exist: they could just use re.search in most cases[^1], possibly with a regex like ^... (if they want the re.match behavior) or ^...$ (if they want the re.fullmatch behavior). Technically linters could already suggest that (even though it might be a tougher sell than an explicit function).

    This would help avoid bugs where people mean re.search but write re.match and would ease the cognitive load [...]

    It might also increase it though, since now people will have to decide whether to use search, match, fullmatch, prefixmatch and wonder about their differences, and the fact that two have different names but the same behavior is potentially confusing (especially if they end up being used together while updating old code). Some might also wonder why there is no suffixmatch, and possibly propose it.

    [^1]: there are some difference when re.MULTILINE is used: https://docs.python.org/3/library/re.html#search-vs-match

    gpshead commented 2 years ago

    the addition will only serve to provide a more explicit alternative for the subset of users that use a linter or that are aware of the function and decide to use it.

    Yes. Exactly! That is the entire point. To allow those who realize that being explicit reduces the cognitive correctness burden for all developers working in their codebase, huge or small, to do so.

    an alternative already exist: they could just use re.search in most cases, possibly with a regex like ^... (if they want the re.match behavior) or ^...$ (if they want the re.fullmatch behavior).

    ^ is not the logical re equivalent to .match behavior. There's a different backslash letter thing that is "start of string" which virtually nobody knows off the top of their head so would be more bewildering. Again, it's about the ease of being explicit. It is obvious what behavior .prefixmatch implies. It is not at all obvious that .match does that given that one talks about regular expression "matches" in English all the time as a matter of terminology and never means "oh, only at the start of the string" in that discussion context. Python has a mismatch with real terminology by differing from other major regex sporting languages/libraries used across the industry here. This offers a way out.

    Without disruption of removing a very widely used API.

    I'm just seeking to enable being explicit in code instead of requiring a code comment saying # prefixmatch upon every occurrence of .match(...) in code as a preventative measure. Practicality beats purity.

    I don't see an added cognitive burden here from a new explicit method name. Nobody needs lookup prefixmatch, it's behavior is there in the name. If it's existence causes them question reality and lookup match, they'll learn something from the suggestion not to use that API name and perhaps eventually modernize their code.

    I would not have filed this feature request if this were not a repeated source of bugs or confusion from those new to Python at work where people regularly use more than one programming language.

    Our old terminology has been outvoted by JavaScript, Golang, Rust, and libpcre. We should adapt by allowing developers to be readably explicit about their intent.

    serhiy-storchaka commented 2 years ago

    I have yet to see an example of mistakenly using re.match() instead of re.search(). re.search() can be mistakenly used instead of re.match() or re.fullmatch(), I seen more than one examples of it, but not vice versa. I can also expect that re.match() is used instead of re.fullmatch() in a number of cases, mostly in a code that predates re.fullmatch().

    I think there is a problem with an initial premise about "a well known anti-pattern".

    We have a habit of removing long deprecated features in Python. For sure, if something is deprecated, it will be removed in 8 or 12 years, even if it is a simple alias. Forcing people to rewrite tons of working code for stylish reason does not look as a worthy goal to me.

    gpshead commented 2 years ago

    I think our disconnect is merely the difference between samples: "you're is not seeing it" and "my colleagues who routinely do code reviews for hundreds of people sometimes new to Python do see it" are both true statements.

    gpshead commented 2 years ago

    There is a good case to be made that we as a project should provide automated code transformers with every Python release that can be used to migrate source code towards the preferred more modern thing when feasible (and encourage changes to be such that this is feasible). Past examples: 2to3 did this where it could (limited) and ultimately is what lies behind the modernize tool which kept that torch lit and is more useful.

    But I wouldn't want to block this mere preferred more explicit name alias going in until such tooling to handle a decade later potential deprecation of the old name being available. They're separate requests.