python / cpython

The Python programming language
https://www.python.org/
Other
61.16k stars 29.52k forks source link

re docs should state exactly which whitespace is matched by \s #118508

Open jeremyredhead opened 2 months ago

jeremyredhead commented 2 months ago

Documentation

Currently, and since 3.0 it seems, it simply states that \s "Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages)."

But "Unicode whitespace characters" seems awfully vague. Exactly which General_Category value does that correspond to? Space_Separator (Zs)? Separator (Z)? Some Python-specific selection of ""Unicode whitespace characters""? It's not entirely clear.

The 2.7 docs were better, stating that "If UNICODE is set, this will match the characters [ \t\n\r\f\v] plus whatever is classified as space in the Unicode character properties database."

I'd like to believe that Python 3.x uses the exact same definition for \s as 2.7 did, and that therefore I already have the answer to my question. I'd like to believe a lot of things. But computers don't run on belief(s).

No one should have to resort to digging thru the source code for the answer to such a simple but important question.

P.S. I did try searching the interwebs for an answer to "which whitespace is matched by \s in python". Unfortunately search engines seem entirely unwilling to help. Perhaps no one else knows or wants to know. That's a shame.

Linked PRs

nineteendo commented 2 months ago

Exactly which General_Category value does that correspond to? Space_Separator (Zs)? Separator (Z)?

All 3: Zl, Zp and Zs

which whitespace is matched by \s in python

>>> import sys, unicodedata
>>> for i in range(sys.maxunicode):
...     char = chr(i)
...     if char.isspace():
...         print(i, repr(char), unicodedata.category(char))
... 
9 '\t' Cc
10 '\n' Cc
11 '\x0b' Cc
12 '\x0c' Cc
13 '\r' Cc
28 '\x1c' Cc
29 '\x1d' Cc
30 '\x1e' Cc
31 '\x1f' Cc
32 ' ' Zs
133 '\x85' Cc
160 '\xa0' Zs
5760 '\u1680' Zs
8192 '\u2000' Zs
8193 '\u2001' Zs
8194 '\u2002' Zs
8195 '\u2003' Zs
8196 '\u2004' Zs
8197 '\u2005' Zs
8198 '\u2006' Zs
8199 '\u2007' Zs
8200 '\u2008' Zs
8201 '\u2009' Zs
8202 '\u200a' Zs
8232 '\u2028' Zl
8233 '\u2029' Zp
8239 '\u202f' Zs
8287 '\u205f' Zs
12288 '\u3000' Zs
nineteendo commented 4 weeks ago

Would someone like to review my pull request?