python / cpython

The Python programming language

https://www.python.org

Other

63.36k stars 30.34k forks source link

re: documentation claim that special characters lose their special meaning inside […] seems wrong #106482

Open calestyo opened 1 year ago

calestyo commented 1 year ago

Documentation

The claim at: https://github.com/python/cpython/blob/d0c6ba956fca28785ad4dea6423cd44fd1124cad/Doc/library/re.rst?plain=1#L253-L255 seems wrong at least for \.

Consider the following example:

>>> bool(re.search(string=b"a\\b",pattern=b"[\\\n\r]"))
False

My expectation would be that after backslash-unescaping the b"…"-string, pattern is assigned the sequence of:
literal \, the line-feed "character", the carriage-return "character"

If it would be true, that "Special characters lose their special meaning inside sets.", then the resolved \ in the unescaped pattern should match the one in my test string b"a\\b", however it does not.

I guess what Python actually "sees" is:
backslash-escaped line-feed "character", the carriage-return "character"
which probably effectively yields:
the line-feed "character", the carriage-return "character"

Now you could argue that the \ is not considered a special-character for the terms of the regular expression syntax... but it is, at least already because of: https://github.com/python/cpython/blob/d0c6ba956fca28785ad4dea6423cd44fd1124cad/Doc/library/re.rst?plain=1#L504-L507 and ff..

Also, even the section that explains […] mentions the escaping functionality of it: https://github.com/python/cpython/blob/d0c6ba956fca28785ad4dea6423cd44fd1124cad/Doc/library/re.rst?plain=1#L249-L250

I think: https://github.com/python/cpython/blob/d0c6ba956fca28785ad4dea6423cd44fd1124cad/Doc/library/re.rst?plain=1#L253-L255 should be improved to document that:

\ is exempt from this
whether or this is only the case for characters that are actually special with respect to the RE bracket expression, i.e. [0\-9] is 0, - and 9, because the - was special in that position. But what about [\-9]? Here, the - would not have been special, so it the result \, - and 9 or just - and 9?
or whether this is simply the case for any character following the \ ... ones that are special outside and RE bracket expression, like \$, \D. \w or \number... and/or ones that are never special, like \ü.

Thanks, Chris.

Linked PRs

gh-106517

terryjreedy commented 1 year ago

@serhiy-storchaka You might be the only coredev that can answer this question about \ in re [...] set expressions.

vadmium commented 1 year ago

Agreed the documentation on what is allowed in square-bracket character sets/classes could be made clearer.

There is documentation suggesting to escape a literal closing square bracket \], an initial opening bracket [, and doubled hyphens, ampersands, tildes, and vertical bars (--, &&, ~~, ||). So I conclude that in [\-9] the backslash and hyphen \- represent just a single literal hyphen character.

In the how-to https://docs.python.org/3/howto/regex.html#matching-characters, the six predefined backslash character classes \d \D \s \S \w \W are documented as allowed in square brackets. Also, \b is documented as representing the backspace control character in square brackets.

A related limitation is it is not clear if there is any way to have a literal backslash in square brackets.

calestyo commented 1 year ago

A related limitation is it is not clear if there is any way to have a literal backslash in square brackets. Shouldn't that just be via \\? Or do you mean that it's not yet properly documented?

In any case I'd hope that either \ always has a special meaning (which would include, that it (needlessly) quotes a following character with no special meaning like in \ü) inside bracket expressions - or never.

It would IMO be extremely confusing, if the specialness of \ depended on what followed, lik in the[\-9], I gave above.

Other things: https://github.com/python/cpython/blob/d0c6ba956fca28785ad4dea6423cd44fd1124cad/Doc/library/re.rst?plain=1#L504-L507 I would interpret this as follows:

any character that is not a ASCII-digit/letter is forever defined to be just that character, i.e. it's guaranteed that \ü or \- will never be special.
conversely, not only those ASCII-digits/letters that are already listed may be special, that is \q, may once become special
It's not clear what such non-defined \ + <ASCII-letter-or-digit> yield in terms of behaviour (or did I just miss that somewhere?). Do they resolve to the literal character? Give an exception?
The initial text at: https://github.com/python/cpython/blob/3e5ce7968f5ab715f649e296e1f6b499621b8091/Doc/library/re.rst?plain=1#L31-L34 obviously means the \-escapes from the strings, not from the REs.
In terms of RE-escapes, \n, \t and friends do not seem to be defined... so r"[\n]", AFAIU, should fall under the previous question is: that a literal n, does it give an exception... or should it also be made special in terms of RE, so that: r"[\t]" would be effectively the same as "[\t]" and both match a horizontal tab?

vadmium commented 1 year ago

With Serhiy’s documentation changes, I think backslash escaping would be defined for the hyphen \-, \ü, and standard Python string escapes including \n, \t and the backslash itself \\.

For reserved ASCII letters like \q, the documentation would say they are errors. The code looks like it checks and raises an exception, but I’m not sure it is worth making that documented behaviour.

calestyo commented 1 year ago

Not really sure about that... in his commit he says:

   * Backslash either escapes characters which have special meaning in a set
     such as ``'-'``, ``']'``, ``'^'`` and ``'\\'`` itself or signals
     a special sequence which represents a single character such as
     ``\xa0`` or ``\n`` or a character class such as ``\w`` or ``\S``
     (defined below).

I would interpret this, as the \ in [\-9] is not an escaping one, as - would not have a special meaning at that place (neither is it a special sequence like \d.
It get's even more weird, cause if it's not escaping, it would be a normal literal \. But then: is this now the set of \, - and 9 - or is it the sequence of characters from \ to 9 (in which case the - would be special again ;-) ), which would however be invalid, as \ is 0x5c and 9 is 0x30.

And maybe I miss something, but I think it's still unclear, whether \q or \ü are allowed and what they'd yield.
The former is an ASCII letter, but not yet defined, and the patche's:

     Special sequences which do not match a single character such as ``\A``
     and ``\Z`` are not allowed.

mean special sequences which do not match a single char (but zero, or - if ever - more than one).
The latter (ü) is not ASCII, but his current wording would rather imply to me that it's either not allowed inside a bracket expression or undefined.

vadmium commented 1 year ago

Would the following bullet point work?

Backslash followed by any character other than an ASCII digit or ASCII letter escapes any special meaning that character may have on its own, such as with '-', ']', '^' and '\\' itself. Backslash followed by an ASCII digit or letter signals a special sequence which represents a single character such as \xa0 or \n or a character class such as \w or \S (defined below). Note that \b represents a single “backspace” character, not a word boundary as outside a set, and numeric escapes such as \1 are always octal escapes, not group references. Special sequences which do not match a single character such as \A and \Z are not allowed.

calestyo commented 1 year ago

Hmm. Strictly speaking it's IMO still insofar unclear, that:

Backslash followed by any character other than an ASCII digit or ASCII letter escapes any special meaning that character may have on its own

doesn't definitely tell (the "may" could be read in different ways IMO), what happens, if the following character does not have a special meaning, either because it generally has none (like in the \ü case) or because it does not have one at that position (like in the [\-e] case, where, without the \ the - would not be special).

What about the following:

Within a bracket expression, a backslash escapes the following character.
- If that character has no special meaning at that position (like [\ü] or [\-9], which yield the literal ü respectively the literal - and 9) it results in the character only (the escaping \ is not kept as a literal character on it's own).
- If that character has special meaning (like [\\], [\]] or [0\-9], which yield the literal \ respectively the literal ] respectively the literal 0, - and 9) it results in the literal character (with no special meaning) only (the escaping \ is not kept as a literal character on it's own).
- Characters with special meaning inside a bracket expression include:
- The characters specifically special for bracket expressions themselves \, -, ^ and ]
- If preceded by an escaping \, any ASCII digit or ASCII letter, if it represents a single character (like \n, or \xA0 as well as characters classes like \w or \S). Note that \b represents a single “backspace” character, not a word boundary as outside a set, and numeric escapes such as \1 are always octal escapes, not group references. If the character is an ASCII letter or ASCII digits, but the escape sequence has not yet a defined meaning (like \q) or does not match a single character (like \A or \Z its use is invalid and ???? raises an exception.

Did I forget anything? ^^

calestyo commented 1 year ago

Maybe one could extend the:

or [\-9], which yield the literal ü respectively the literal - and 9)

even to:

or [\-9], which yield the literal ü respectively the literal - and 9 but not the literal \)

vadmium commented 1 year ago

I hoped writing “any special meaning” would imply if there was no special meaning nothing happens. Would it be clearer to expand to the following?

“Backslash followed by any character other than an ASCII digit or ASCII letter escapes the special meaning that character would have on its own, such as with '-', ']', '^' and '\\' itself, or is ignored if there is no special meaning. Backslash followed by an ASCII digit or letter . . .”

I’m confused in your suggestion when you say for example n in \n has a special meaning, but also claim if the character following the backslash has a special meaning, it results in the literal character. Wouldn’t this mean \n represents a literal n rather than a newline? I think the deciding factor between a literal character vs something else is whether that second character is ASCII alphanumeric, not whether the character has a “special meaning”.

calestyo commented 1 year ago

I hoped writing “any special meaning” would imply if there was no special meaning nothing happens. Would it be clearer to expand to the following?

I know, which is why I wrote »(the "may" could be read in different ways IMO)«.

IMO a reader could interpret this correctly as "the character has a special meaning or it does not have a special meaning.
But there's the case of e.g. a, which alone by itself never has a special meaning, but only in combination with a leading \. Whereas others like - or ^ have or have no special meaning, just depending on their position (regardless of a leading \.
Also, one could interpret “any special meaning” not just as "it has one, or it has none", but also as "one out of a set of special meanings", like e.g. ^ has, which can be the start anchor (outside a bracket expression) or the set-negator (inside a bracket expression).

“Backslash followed by any character other than an ASCII digit or ASCII letter escapes the special meaning that character would have on its own, such as with '-', ']', '^' and '\' itself, or is ignored if there is no special meaning. Backslash followed by an ASCII digit or letter . . .”

Hmm. I think it goes in a better direction, but that alone would IMO be still ambiguous in cases like e.g. [\-9] because if there was no \, then the - would have no special meaning, thus in the above wording "escapes the special meaning that character would have on its own" (it would have none here), it would mean that the \ is not ignored, forming an invalid range here. Which might however also be easily a valid one assuming e.g. [\-z].

I’m confused in your suggestion when you say for example n in \n has a special meaning, but also claim if the character following the backslash has a special meaning, it results in the literal character. Wouldn’t this mean \n represents a literal n rather than a newline? I think the deciding factor between a literal character vs something else is whether that second character is ASCII alphanumeric, not whether the character has a “special meaning”.

That's indeed a flaw. One could perhaps write, that it becomes the literal character (only), if the character alone would have the special meaning (like the - in [0-9]) - whereas in contrast, it becomes the "special character" (like newline), if the character becomes it's special meaning through the preceding \?
Just an idea though.

Ymiros0 commented 1 year ago

To add on to this imo "special characters lose their meaning inside sets" sounds like there are no special characters inside sets whereas actually there even are new special characters that don't have a special meaning outside sets (^ and -) (Which tbf should be quite obvious to the reader of that segment, but might be confusing nonetheless). I don't quite follow this entire discussion about backslashes though, is there any major difference between backslashes outside sets and inside sets I am unaware of? Wouldn't it be simpler to say that they behave the same with the exception of escape sequences that do not define a single character?

Btw is it just me or is the spacing between bullet points fluctuating in that section?

vadmium commented 3 weeks ago

Wouldn't it be simpler to say that they behave the same with the exception of escape sequences that do not define a single character?

I think so. Additional exceptions about \b for backspace and octal escapes vs group references are already documented.

The only other thing that comes to mind is a technicality in the wording for non-alphanumeric escapes. An escaped character in a complemented set seems to exclude that character from matches like any other ordinary character, but we currently say “the resulting RE will match the second character”.

>>> re.fullmatch(r'[\@]', '@')  # Matches escaped character
<re.Match object; span=(0, 1), match='@'>
>>> print(re.fullmatch(r'[^\@]', '@'))  # Does not match due to set complement
None

Btw is it just me or is the spacing between bullet points fluctuating in that section?

Yes I think every time there is an index entry, it starts a new bullet list spaced from the previous list. Not sure if it is possible to have an index entry pointing inside a bullet list.