sg16-unicode / sg16

SG16 overview and general information
45 stars 5 forks source link

Extend whitespace to include NEL, LS, PS, LRM, RLM, and maybe ALM #74

Open tahonermann opened 2 years ago

tahonermann commented 2 years ago

Unicode paper L2/22-072R: Proposal for amendments to UAX#9 and UAX#31, adopted for the upcoming Unicode 15 release, demonstrates the utility in allowing U+200E LEFT-TO-RIGHT MARK (LRM) and U+200F RIGHT-TO-LEFT MARK (RLM) to appear in whitespace, but not to constitute whitespace in isolation. The intent is to allow these marks to be inserted in whitespace in order to restore character directionality that might have been altered by characters in the preceding token.

tahonermann commented 2 years ago

I updated the issue title to extend this issue to cover the inclusion of all of the following characters in whitespace. This would suffice for C++ to meet the Pattern_White_Space requirements of UAX31-R3.

Additionally, inclusion of the ALM should be considered as it is conceptually similar to LRM and RLM, though it is not a member of the Pattern_White_Space property (and cannot be added because that property is immutable). Including this character in whitespace would require the specification of a profile in [uaxid.pattern] for conformance with UAX31-R3.