Open calestyo opened 1 year ago
@serhiy-storchaka You might be the only coredev that can answer this question about \ in re [...] set expressions.
Agreed the documentation on what is allowed in square-bracket character sets/classes could be made clearer.
There is documentation suggesting to escape a literal closing square bracket \]
, an initial opening bracket [, and doubled hyphens, ampersands, tildes, and vertical bars (--, &&, ~~, ||). So I conclude that in [\-9]
the backslash and hyphen \-
represent just a single literal hyphen character.
In the how-to https://docs.python.org/3/howto/regex.html#matching-characters, the six predefined backslash character classes \d \D \s \S \w \W are documented as allowed in square brackets. Also, \b is documented as representing the backspace control character in square brackets.
A related limitation is it is not clear if there is any way to have a literal backslash in square brackets.
A related limitation is it is not clear if there is any way to have a literal backslash in square brackets. Shouldn't that just be via
\\
? Or do you mean that it's not yet properly documented?
In any case I'd hope that either \
always has a special meaning (which would include, that it (needlessly) quotes a following character with no special meaning like in \ü
) inside bracket expressions - or never.
It would IMO be extremely confusing, if the specialness of \
depended on what followed, lik in the[\-9]
, I gave above.
Other things:
https://github.com/python/cpython/blob/d0c6ba956fca28785ad4dea6423cd44fd1124cad/Doc/library/re.rst?plain=1#L504-L507
I would interpret this as follows:
any character that is not a ASCII-digit/letter is forever defined to be just that character, i.e. it's guaranteed that \ü
or \-
will never be special.
conversely, not only those ASCII-digits/letters that are already listed may be special, that is \q
, may once become special
It's not clear what such non-defined \
+ <ASCII-letter-or-digit>
yield in terms of behaviour (or did I just miss that somewhere?). Do they resolve to the literal character? Give an exception?
The initial text at:
https://github.com/python/cpython/blob/3e5ce7968f5ab715f649e296e1f6b499621b8091/Doc/library/re.rst?plain=1#L31-L34
obviously means the \
-escapes from the strings, not from the REs.
In terms of RE-escapes, \n
, \t
and friends do not seem to be defined... so r"[\n]"
, AFAIU, should fall under the previous question is: that a literal n
, does it give an exception... or should it also be made special in terms of RE, so that: r"[\t]"
would be effectively the same as "[\t]"
and both match a horizontal tab?
With Serhiy’s documentation changes, I think backslash escaping would be defined for the hyphen \-
, \ü
, and standard Python string escapes including \n
, \t
and the backslash itself \\
.
For reserved ASCII letters like \q
, the documentation would say they are errors. The code looks like it checks and raises an exception, but I’m not sure it is worth making that documented behaviour.
Not really sure about that... in his commit he says:
* Backslash either escapes characters which have special meaning in a set
such as ``'-'``, ``']'``, ``'^'`` and ``'\\'`` itself or signals
a special sequence which represents a single character such as
``\xa0`` or ``\n`` or a character class such as ``\w`` or ``\S``
(defined below).
I would interpret this, as the \
in [\-9]
is not an escaping one, as -
would not have a special meaning at that place (neither is it a special sequence like \d
.
It get's even more weird, cause if it's not escaping, it would be a normal literal \
. But then: is this now the set of \
, -
and 9
- or is it the sequence of characters from \
to 9
(in which case the -
would be special again ;-) ), which would however be invalid, as \
is 0x5c
and 9
is 0x30
.
And maybe I miss something, but I think it's still unclear, whether \q
or \ü
are allowed and what they'd yield.
The former is an ASCII letter, but not yet defined, and the patche's:
Special sequences which do not match a single character such as ``\A``
and ``\Z`` are not allowed.
mean special sequences which do not match a single char (but zero, or - if ever - more than one).
The latter (ü
) is not ASCII, but his current wording would rather imply to me that it's either not allowed inside a bracket expression or undefined.
Would the following bullet point work?
'-'
, ']'
, '^'
and '\\'
itself. Backslash followed by an ASCII digit or letter signals a special sequence which represents a single character such as \xa0 or \n or a character class such as \w or \S (defined below). Note that \b represents a single “backspace” character, not a word boundary as outside a set, and numeric escapes such as \1 are always octal escapes, not group references. Special sequences which do not match a single character such as \A and \Z are not allowed.Hmm. Strictly speaking it's IMO still insofar unclear, that:
Backslash followed by any character other than an ASCII digit or ASCII letter escapes any special meaning that character may have on its own
doesn't definitely tell (the "may" could be read in different ways IMO), what happens, if the following character does not have a special meaning, either because it generally has none (like in the \ü
case) or because it does not have one at that position (like in the [\-e]
case, where, without the \
the -
would not be special).
What about the following:
[\ü]
or [\-9]
, which yield the literal ü
respectively the literal -
and 9
) it results in the character only (the escaping \
is not kept as a literal character on it's own).[\\]
, [\]]
or [0\-9]
, which yield the literal \
respectively the literal ]
respectively the literal 0
, -
and 9
) it results in the literal character (with no special meaning) only (the escaping \
is not kept as a literal character on it's own).\
, -
, ^
and ]
\
, any ASCII digit or ASCII letter, if it represents a single character (like \n
, or \xA0
as well as characters classes like \w
or \S
). Note that \b
represents a single “backspace” character, not a word boundary as outside a set, and numeric escapes such as \1
are always octal escapes, not group references. If the character is an ASCII letter or ASCII digits, but the escape sequence has not yet a defined meaning (like \q
) or does not match a single character (like \A
or \Z
its use is invalid and ???? raises an exception.Did I forget anything? ^^
Maybe one could extend the:
or
[\-9]
, which yield the literalü
respectively the literal-
and9
)
even to:
or
[\-9]
, which yield the literalü
respectively the literal-
and9
but not the literal\
)
I hoped writing “any special meaning” would imply if there was no special meaning nothing happens. Would it be clearer to expand to the following?
“Backslash followed by any character other than an ASCII digit or ASCII letter escapes the special meaning that character would have on its own, such as with '-'
, ']'
, '^'
and '\\'
itself, or is ignored if there is no special meaning. Backslash followed by an ASCII digit or letter . . .”
I’m confused in your suggestion when you say for example n in \n
has a special meaning, but also claim if the character following the backslash has a special meaning, it results in the literal character. Wouldn’t this mean \n
represents a literal n rather than a newline? I think the deciding factor between a literal character vs something else is whether that second character is ASCII alphanumeric, not whether the character has a “special meaning”.
I hoped writing “any special meaning” would imply if there was no special meaning nothing happens. Would it be clearer to expand to the following?
I know, which is why I wrote »(the "may" could be read in different ways IMO)«.
IMO a reader could interpret this correctly as "the character has a special meaning or it does not have a special meaning.
But there's the case of e.g. a
, which alone by itself never has a special meaning, but only in combination with a leading \
. Whereas others like -
or ^
have or have no special meaning, just depending on their position (regardless of a leading \
.
Also, one could interpret “any special meaning” not just as "it has one, or it has none", but also as "one out of a set of special meanings", like e.g. ^
has, which can be the start anchor (outside a bracket expression) or the set-negator (inside a bracket expression).
“Backslash followed by any character other than an ASCII digit or ASCII letter escapes the special meaning that character would have on its own, such as with '-', ']', '^' and '\' itself, or is ignored if there is no special meaning. Backslash followed by an ASCII digit or letter . . .”
Hmm. I think it goes in a better direction, but that alone would IMO be still ambiguous in cases like e.g. [\-9]
because if there was no \
, then the -
would have no special meaning, thus in the above wording "escapes the special meaning that character would have on its own" (it would have none here), it would mean that the \
is not ignored, forming an invalid range here. Which might however also be easily a valid one assuming e.g. [\-z]
.
I’m confused in your suggestion when you say for example n in \n has a special meaning, but also claim if the character following the backslash has a special meaning, it results in the literal character. Wouldn’t this mean \n represents a literal n rather than a newline? I think the deciding factor between a literal character vs something else is whether that second character is ASCII alphanumeric, not whether the character has a “special meaning”.
That's indeed a flaw. One could perhaps write, that it becomes the literal character (only), if the character alone would have the special meaning (like the -
in [0-9]
) - whereas in contrast, it becomes the "special character" (like newline), if the character becomes it's special meaning through the preceding \
?
Just an idea though.
To add on to this imo "special characters lose their meaning inside sets" sounds like there are no special characters inside sets whereas actually there even are new special characters that don't have a special meaning outside sets (^
and -
) (Which tbf should be quite obvious to the reader of that segment, but might be confusing nonetheless).
I don't quite follow this entire discussion about backslashes though, is there any major difference between backslashes outside sets and inside sets I am unaware of? Wouldn't it be simpler to say that they behave the same with the exception of escape sequences that do not define a single character?
Btw is it just me or is the spacing between bullet points fluctuating in that section?
Wouldn't it be simpler to say that they behave the same with the exception of escape sequences that do not define a single character?
I think so. Additional exceptions about \b for backspace and octal escapes vs group references are already documented.
The only other thing that comes to mind is a technicality in the wording for non-alphanumeric escapes. An escaped character in a complemented set seems to exclude that character from matches like any other ordinary character, but we currently say “the resulting RE will match the second character”.
>>> re.fullmatch(r'[\@]', '@') # Matches escaped character
<re.Match object; span=(0, 1), match='@'>
>>> print(re.fullmatch(r'[^\@]', '@')) # Does not match due to set complement
None
Btw is it just me or is the spacing between bullet points fluctuating in that section?
Yes I think every time there is an index entry, it starts a new bullet list spaced from the previous list. Not sure if it is possible to have an index entry pointing inside a bullet list.
Documentation
The claim at: https://github.com/python/cpython/blob/d0c6ba956fca28785ad4dea6423cd44fd1124cad/Doc/library/re.rst?plain=1#L253-L255 seems wrong at least for
\
.Consider the following example:
My expectation would be that after backslash-unescaping the
b"…"
-string,pattern
is assigned the sequence of:literal
\
, the line-feed "character", the carriage-return "character"If it would be true, that "Special characters lose their special meaning inside sets.", then the resolved
\
in the unescapedpattern
should match the one in my test stringb"a\\b"
, however it does not.I guess what Python actually "sees" is:
backslash-escaped line-feed "character", the carriage-return "character"
which probably effectively yields:
the line-feed "character", the carriage-return "character"
Now you could argue that the
\
is not considered a special-character for the terms of the regular expression syntax... but it is, at least already because of: https://github.com/python/cpython/blob/d0c6ba956fca28785ad4dea6423cd44fd1124cad/Doc/library/re.rst?plain=1#L504-L507 and ff..Also, even the section that explains
[…]
mentions the escaping functionality of it: https://github.com/python/cpython/blob/d0c6ba956fca28785ad4dea6423cd44fd1124cad/Doc/library/re.rst?plain=1#L249-L250I think: https://github.com/python/cpython/blob/d0c6ba956fca28785ad4dea6423cd44fd1124cad/Doc/library/re.rst?plain=1#L253-L255 should be improved to document that:
\
is exempt from this[0\-9]
is0
,-
and9
, because the-
was special in that position. But what about[\-9]
? Here, the-
would not have been special, so it the result\
,-
and9
or just-
and9
?\
... ones that are special outside and RE bracket expression, like\$
,\D
.\w
or\number
... and/or ones that are never special, like\ü
.Thanks, Chris.
Linked PRs