Specify what constitutes white-space characters

tahonermann commented 3 years ago

The C++ standard defines behavior that depends on whether a character constitutes white-space, but never defines what those characters are. Uses of the "whitespace" and "white-space" terms appear in:

P2178 proposal 2 sought to clarify the set of characters that constitute white-space and proposed the following set. These characters all satisfy the immutable Pattern_White_Space property (see UAX #44 and/or search for Pattern_White_Space in the UCD).

U+0009: CHARACTER TABULATION
U+000A: LINE FEED (LF)
U+000B: LINE TABULATION
U+000C: FORM FEED (FF)
U+000D: CARRIAGE RETURN (CR)
U+0020: SPACE
U+0085: NEXT LINE (NEL)
U+200E: LEFT-TO-RIGHT MARK
U+200F: RIGHT-TO-LEFT MARK
U+2028: LINE SEPARATOR
U+2029: PARAGRAPH SEPARATOR

The above set of characters excludes the following characters that satisfy the (not immutable) White_Space property (see UAX #44 and/or search for White_Space in the UCD).

U+00A0: NO-BREAK SPACE
U+1680: OGHAM SPACE MARK
U+2000: EN QUAD
U+2001: EM QUAD
U+2002: EN SPACE
U+2003: EM SPACE
U+2004: THREE-PER-EM SPACE
U+2005: FOUR-PER-EM SPACE
U+2006: SIX-PER-EM SPACE
U+2007: FIGURE SPACE
U+2008: PUNCTUATION SPACE
U+2009: THIN SPACE
U+200A: HAIR SPACE
U+202F: NARROW NO-BREAK SPACE
U+205F: MEDIUM MATHEMATICAL SPACE
U+3000: IDEOGRAPHIC SPACE

When addressing this issue, we may want to take the opportunity to replace the existing "whitespace" and "white-space" terminology with "blank space"; ISO guidance may require such a renaming in the future.

tahonermann commented 3 years ago

Actually, the standard does supply a list of whitespace characters in [lex.pptoken]p2:

... Preprocessing tokens can be separated by whitespace; this consists of comments ([lex.comment]), or whitespace characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. ...

and again in [lex.token]p1:

... Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments (collectively, “whitespace”), as described below, are ignored except as they serve to separate tokens.

[Note 1: Some whitespace is required to separate otherwise adjacent identifiers, keywords, numeric literals, and alternative tokens containing alphabetic characters. — end note]

steve-downey commented 3 years ago

Note that 'new-line' there is already a term of art. It possibly includes various combinations of

U+000A: LINE FEED (LF)
U+000D: CARRIAGE RETURN (CR)

On Tue, Mar 23, 2021 at 4:14 PM Tom Honermann @.***> wrote:

Actually, the standard does supply a list of whitespace characters in [lex.pptoken]p2 http://eel.is/c++draft/lex#pptoken-2:

... Preprocessing tokens can be separated by whitespace; this consists of comments ([lex.comment]), or whitespace characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. ...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sg16-unicode/sg16/issues/69#issuecomment-805215615, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVNZ5UUC4SBXUC5D4SFHEDTFDZCFANCNFSM4ZV4FG4Q .

tahonermann commented 3 years ago

P2295 addresses this. The wording in revision 0 proposes a subset of the characters in P2178; it omits:

U+200E: LEFT-TO-RIGHT MARK
U+200F: RIGHT-TO-LEFT MARK

tahonermann commented 3 years ago

Later revisions of P2295 no longer address this.

cor3ntin commented 3 years ago

P2348 - of which an early draft is there https://isocpp.org/files/papers/D2348R0.pdf rewords the handling of whitspaces and new lines without extending the set

tahonermann commented 3 years ago

This issue was discussed on the Unicode.org mailing list. There was a recommendation from a Unicode expert that, for programming languages, Pattern_White_Space may be a useful starting point, but that it might make sense to drop the U+200E and U+200F bidirectional markers and add U+3000 (IDEOGRAPHIC SPACE).

jensmaurer commented 3 years ago

The total feedback was a single response, though.

sg16-unicode / sg16

Specify what constitutes white-space characters #69