sg16-unicode / sg16

SG16 overview and general information
45 stars 5 forks source link

Specify what constitutes white-space characters #69

Open tahonermann opened 3 years ago

tahonermann commented 3 years ago

The C++ standard defines behavior that depends on whether a character constitutes white-space, but never defines what those characters are. Uses of the "whitespace" and "white-space" terms appear in:

P2178 proposal 2 sought to clarify the set of characters that constitute white-space and proposed the following set. These characters all satisfy the immutable Pattern_White_Space property (see UAX #44 and/or search for Pattern_White_Space in the UCD).

The above set of characters excludes the following characters that satisfy the (not immutable) White_Space property (see UAX #44 and/or search for White_Space in the UCD).

When addressing this issue, we may want to take the opportunity to replace the existing "whitespace" and "white-space" terminology with "blank space"; ISO guidance may require such a renaming in the future.

tahonermann commented 3 years ago

Actually, the standard does supply a list of whitespace characters in [lex.pptoken]p2:

... Preprocessing tokens can be separated by whitespace; this consists of comments ([lex.comment]), or whitespace characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. ...

and again in [lex.token]p1:

... Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments (collectively, “whitespace”), as described below, are ignored except as they serve to separate tokens.

[Note 1: Some whitespace is required to separate otherwise adjacent identifiers, keywords, numeric literals, and alternative tokens containing alphabetic characters. — end note]

steve-downey commented 3 years ago

Note that 'new-line' there is already a term of art. It possibly includes various combinations of

On Tue, Mar 23, 2021 at 4:14 PM Tom Honermann @.***> wrote:

Actually, the standard does supply a list of whitespace characters in [lex.pptoken]p2 http://eel.is/c++draft/lex#pptoken-2:

... Preprocessing tokens can be separated by whitespace; this consists of comments ([lex.comment]), or whitespace characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. ...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sg16-unicode/sg16/issues/69#issuecomment-805215615, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVNZ5UUC4SBXUC5D4SFHEDTFDZCFANCNFSM4ZV4FG4Q .

tahonermann commented 3 years ago

P2295 addresses this. The wording in revision 0 proposes a subset of the characters in P2178; it omits:

tahonermann commented 3 years ago

Later revisions of P2295 no longer address this.

cor3ntin commented 3 years ago

P2348 - of which an early draft is there https://isocpp.org/files/papers/D2348R0.pdf rewords the handling of whitspaces and new lines without extending the set

tahonermann commented 3 years ago

This issue was discussed on the Unicode.org mailing list. There was a recommendation from a Unicode expert that, for programming languages, Pattern_White_Space may be a useful starting point, but that it might make sense to drop the U+200E and U+200F bidirectional markers and add U+3000 (IDEOGRAPHIC SPACE).

jensmaurer commented 3 years ago

The total feedback was a single response, though.