Open gpshead opened 4 years ago
A well known anti-pattern in Python is use of re.match when you meant to use re.search.
re.fullmatch
was added in 3.4 via https://bugs.python.org/issue16203 for similar reasons.
re.prefixmatch
would be similar: we want the re.match behavior, but want the code to be obvious about its intent. This documents the implicit \A
(or ^
when not re.MULTILINE
) in the name.
The goal would be to allow linters to ultimately flag re.match
as the anti-pattern when in 3.12+ mode. Asking people to use re.prefixmatch
or re.search
instead.
This would help avoid bugs where people mean re.search
but write re.match
and would ease the cognitive load at code review time by encouraging explicit code without detailed specific Python re
domain knowledge on the reviewers part.
The implementation is trivial.
This is not a decision to deprecate the widely used in 25 years worth of code's re.match name. That'd be painful and is unlikely to be worth doing without spreading it over a 8+ year timeline.
I seen a code which uses re.search() with anchor ^ instead of re.match(), but I never seen a code which uses re.match() instead of re.search(). It just won't work unless you add explicit ".*" or ".*?" at the start of the pattern, and it is a clear indication that re.match() matches the start of the string.
My point is that re.match is a common bug when people really want re.search.
re.prefixmatch makes it explicit and non-confusing and thus unlikely to be used wrong or misunderstood when read or reviewed.
The term "match" when talking about regular expressions is not normally meant to imply any anchoring as anchors can be expressed within the regex. Python is relatively unique in bothering to have different methods for a prefix match and an anywhere match. (We'd have been better off without a match method entirely, only having search - too late now)
It just won't work unless you add explicit ".*" or ".*?" at the start of the pattern
But think of when regexes are used for validating input. Getting it to "just work" may be over-permissive validation that only actually checks the beginning of the input. They're one missed test case away from a crash or, worse, a security issue.
This proposed name change would help make the function behavior obvious at the callsite. In the validator example, calling "prefixmatch" would stand out as wrong to even the most oblivious, documentation-averse user.
My point is that re.match is a common bug when people really want re.search.
While I think better distinguishing the interfaces is a nice-to-have for usability, I think it has more absolute benefit to correctness. Again, confusion around the semantics of "match" were the motivation for adding "fullmatch" in the first place but that change only went so far to address the problem: It's still too easy to misuse the existing "match" interface and it's not realistic to remove it from the language. A new name would eliminate this class of error at a very low cost.
What do other APIs in widely used languages do with regex terminology? We appear to be the only popular language who anchors to the start of a string with an API named "match".
libpcre C: uses "match" to mean what we call "search" - https://www.pcre.org/current/doc/html/pcre2_match.html
Go: Uses "Match" to mean what we call "search" - https://pkg.go.dev/regexp#Match
JavaScript: Uses "match" to mean what we call "search" - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/match
Java: Uses "matches" (I think meaning what we call fullmatch?) - https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
C++ RE2: explicit "FullMatch" and "PartialMatch" APIs - https://github.com/google/re2
Jave re2j: uses "matches" like Java regex.Pattern - https://github.com/google/re2j
Ruby: Uses "match" as we do "search" - https://ruby-doc.org/core-2.4.0/Regexp.html
Rust: Uses match as we do "search" - https://docs.rs/regex/latest/regex/
I am not convinced. What are examples of using re.match() instead of re.search()? How common is this type of errors?
There are perhaps many millions of scripts which use re.match(), deprecating re.match() at any time in future would be very destructive, and keeping an alias indefinitely would only add more confusion.
From an explicit is better than implicit standpoint I still think this is a good idea. Even if never actually deprecating and removing re.match
Just promoting use of the explicit spelling of the API call in documentation and examples should be benificial to the ecosystem as a whole. Code analyzers (pylint, etc) could then suggest prefixmatch
as more explicit when it finds an re.match
call.
(I edited the opening message to make it more clear, yay markdown)
For anyone else who just learned about re.fullmatch()
, the docs are at https://docs.python.org/3/library/re.html#re.fullmatch .
The naming of re.match()
is rather unfortunate. I don't know how often the mistake is made of using the wrong function/method, but I'm sure it's more than non-zero. I'm +0 on the alias.
I'm also not convinced :thinking:
A well known anti-pattern in Python is use of
re.match
when you meant to usere.search
.
If people are not aware that re.prefixmatch
exists and/or of the behavior of re.match
, then they will keep using re.match
, possibly incorrectly. Many are not aware of re.fullmatch
either.
The goal would be to allow linters to ultimately flag
re.match
as the anti-pattern when in 3.12+ mode. Asking people to usere.prefixmatch
orre.search
instead.
This will only reach people that use such linters, similarly to how a warning in the docs only reaches people that do read the docs (and notice the warning). Both could be applied to mitigate the problem, but neither are definitive solutions, and adding a new function has a cost. So, unless I'm missing something, the addition will only serve to provide a more explicit alternative for the subset of users that use a linter or that are aware of the function and decide to use it.
However, an alternative already exist: they could just use re.search
in most cases[^1], possibly with a regex like ^...
(if they want the re.match
behavior) or ^...$
(if they want the re.fullmatch
behavior). Technically linters could already suggest that (even though it might be a tougher sell than an explicit function).
This would help avoid bugs where people mean re.search but write re.match and would ease the cognitive load [...]
It might also increase it though, since now people will have to decide whether to use search
, match
, fullmatch
, prefixmatch
and wonder about their differences, and the fact that two have different names but the same behavior is potentially confusing (especially if they end up being used together while updating old code). Some might also wonder why there is no suffixmatch
, and possibly propose it.
[^1]: there are some difference when re.MULTILINE
is used: https://docs.python.org/3/library/re.html#search-vs-match
the addition will only serve to provide a more explicit alternative for the subset of users that use a linter or that are aware of the function and decide to use it.
Yes. Exactly! That is the entire point. To allow those who realize that being explicit reduces the cognitive correctness burden for all developers working in their codebase, huge or small, to do so.
an alternative already exist: they could just use re.search in most cases, possibly with a regex like ^... (if they want the re.match behavior) or ^...$ (if they want the re.fullmatch behavior).
^
is not the logical re equivalent to .match
behavior. There's a different backslash letter thing that is "start of string" which virtually nobody knows off the top of their head so would be more bewildering. Again, it's about the ease of being explicit. It is obvious what behavior .prefixmatch
implies. It is not at all obvious that .match
does that given that one talks about regular expression "matches" in English all the time as a matter of terminology and never means "oh, only at the start of the string" in that discussion context. Python has a mismatch with real terminology by differing from other major regex sporting languages/libraries used across the industry here. This offers a way out.
Without disruption of removing a very widely used API.
I'm just seeking to enable being explicit in code instead of requiring a code comment saying # prefixmatch
upon every occurrence of .match(...)
in code as a preventative measure. Practicality beats purity.
I don't see an added cognitive burden here from a new explicit method name. Nobody needs lookup prefixmatch
, it's behavior is there in the name. If it's existence causes them question reality and lookup match
, they'll learn something from the suggestion not to use that API name and perhaps eventually modernize their code.
I would not have filed this feature request if this were not a repeated source of bugs or confusion from those new to Python at work where people regularly use more than one programming language.
Our old terminology has been outvoted by JavaScript, Golang, Rust, and libpcre. We should adapt by allowing developers to be readably explicit about their intent.
I have yet to see an example of mistakenly using re.match()
instead of re.search()
. re.search()
can be mistakenly used instead of re.match()
or re.fullmatch()
, I seen more than one examples of it, but not vice versa. I can also expect that re.match()
is used instead of re.fullmatch()
in a number of cases, mostly in a code that predates re.fullmatch()
.
I think there is a problem with an initial premise about "a well known anti-pattern".
We have a habit of removing long deprecated features in Python. For sure, if something is deprecated, it will be removed in 8 or 12 years, even if it is a simple alias. Forcing people to rewrite tons of working code for stylish reason does not look as a worthy goal to me.
I think our disconnect is merely the difference between samples: "you're is not seeing it" and "my colleagues who routinely do code reviews for hundreds of people sometimes new to Python do see it" are both true statements.
There is a good case to be made that we as a project should provide automated code transformers with every Python release that can be used to migrate source code towards the preferred more modern thing when feasible (and encourage changes to be such that this is feasible). Past examples: 2to3
did this where it could (limited) and ultimately is what lies behind the modernize
tool which kept that torch lit and is more useful.
But I wouldn't want to block this mere preferred more explicit name alias going in until such tooling to handle a decade later potential deprecation of the old name being available. They're separate requests.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/gpshead' closed_at = None created_at =
labels = ['expert-regex', 'type-feature', 'library', '3.11']
title = 'Proposal: re.prefixmatch method (alias for re.match)'
updated_at =
user = 'https://github.com/gpshead'
```
bugs.python.org fields:
```python
activity =
actor = 'serhiy.storchaka'
assignee = 'gregory.p.smith'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)', 'Regular Expressions']
creation =
creator = 'gregory.p.smith'
dependencies = []
files = []
hgrepos = []
issue_num = 42353
keywords = ['patch']
message_count = 6.0
messages = ['380928', '380975', '380984', '380996', '412554', '412564']
nosy_count = 5.0
nosy_names = ['gregory.p.smith', 'ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'matthew.suozzo']
pr_nums = ['31137']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue42353'
versions = ['Python 3.11']
```