Open tahonermann opened 6 years ago
To be fair, isupper
could be trivially implemented as a test for Unicode's General_Category Lu
. Of course, other C/POSIX character classes don't map to Unicode categories that well. There is ISO TR 30112:2014 (draft), which defines what POSIX classes and conversions should do for every Unicode code point, but I'd agree it isn't what a forward-looking library spec should be considering: I'd like a ctype (or a replacement code point classifier) that can tell me if a code point has General_Category Cc
rather than if it is "cntrl
as interpreted by TR 30112" (which doesn't actually match Cc)
General_Category is the wrong property, I think. Maybe it's ok for Cc (if what you want to test really is C0&C1 control characters), but definitely wrong for isupper. isupper should check Uppercase, which doesn't match gc. Always doubt yourself when you think what you need is General_Category.
Fair point, @rmartinho : TR 30112's definition of isupper
includes non-letters with a case, such as Ⓐ
Anyway, to make my comment clearer:
Can't deprecate this, it's used by iostreams.
Can't deprecate this, it's used by iostreams.
I'm not sure what you are referring to by "this", but deprecation is not removal. We can deprecate features that are still in use.
Changed title to limit scope. Focus on issues currently identified and described in this issue.
Here is a potential replacement http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1628r0.pdf
Note that this is low level ( doesn't mean it should't be provided as it it useful/necessary for lexers among other things ), but in general Unicode recommend this kind of things to be done on strings rather than code points both in locale independent and tailored fashion.
Exhaustive list of functions to deprecate
Name | Description |
---|---|
isspace(std::locale) | checks if a character is classified as whitespace by a locale (function template) |
isblank(std::locale)(C++11) | checks if a character is classified as a blank character by a locale (function template) |
iscntrl(std::locale) | checks if a character is classified as a control character by a locale (function template) |
isupper(std::locale) | checks if a character is classified as uppercase by a locale (function template) |
islower(std::locale) | checks if a character is classified as lowercase by a locale (function template) |
isalpha(std::locale) | checks if a character is classified as alphabetic by a locale (function template) |
isdigit(std::locale) | checks if a character is classified as a digit by a locale (function template) |
ispunct(std::locale) | checks if a character is classified as punctuation by a locale (function template) |
isxdigit(std::locale) | checks if a character is classified as a hexadecimal digit by a locale (function template) |
isalnum(std::locale) | checks if a character is classified as alphanumeric by a locale (function template) |
isprint(std::locale) | checks if a character is classified as printable by a locale (function template) |
isgraph(std::locale) | checks if a character is classfied as graphical by a locale (function template) |
toupper(std::locale) | converts a character to uppercase using the ctype facet of a locale (function template) |
tolower(std::locale) | converts a character to lowercase using the ctype facet of a locale |
Exhaustive list of functions to deprecate
Just the variants that take a std::locale
argument? I think we want to deprecate them all, but the other variants are defined by C. Deprecating them will require specifying suitable replacements for both C and C++.
Just the variants that take a std::locale argument? I think we want to deprecate them all, but the other variants are defined by C. Deprecating them will require specifying suitable replacements for both C and C++.
Good question
And alternative might be to add deprecated (or deleted, might be a hard sale ?) overloads for char8_t
, char16_t
, char32_t
And alternative might be to add deprecated (or deleted, might be a hard sale ?) overloads for char8_t, char16_t, char32_t
I think we should focus more on what an appropriate C replacement would look like first.
Would C be interested in supporting unicode character properties?
On Sat, 3 Aug 2019 at 02:30, Tom Honermann notifications@github.com wrote:
And alternative might be to add deprecated (or deleted, might be a hard sale ?) overloads for char8_t, char16_t, char32_t
I think we should focus more on what an appropriate C replacement would look like first.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sg16-unicode/sg16/issues/2?email_source=notifications&email_token=AAKX766HDH3F6KGCS32IUYDQCTGSXA5CNFSM4E34L7E2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3PDIFA#issuecomment-517878804, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKX764FIALZV2KEJMEHU6TQCTGSXANCNFSM4E34L7EQ .
Would C be interested in supporting unicode character properties?
No idea. My guess is that they would require any replacements to work with (wide) execution encoding and thus existing (non-Unicode) encodings. I see the motivation for replacement being:
unsigned char
. The point is more that we can’t deprecate these (in C) without replacements (in C).
The thing is - I'm pretty sure Unicode character properties are NOT a replacement.
Unicode characters properties should NOT be locale dependent in anyway, cp_isupper(U'Γ') should always be true, regardless of the execution encoding, platform, etc
Ignoring the fact that isupper(foo) (for example) does not support anything but the first 255 value of a given character set, a negative answer means either
Is that a useful information? Is a replacement useful? If it is we still need two api and maybe we can fix the existing one - By fixing your second and third bullet points.
On Sat, 3 Aug 2019 at 16:56, Tom Honermann notifications@github.com wrote:
Would C be interested in supporting unicode character properties?
No idea. My guess is that they would require any replacements to work with (wide) execution encoding and thus existing (non-Unicode) encodings. I see the motivation for replacement being:
- improved error handling; no EOF value handling.
- no UB on values not representable in unsigned char.
- not code unit value based so that variable length encodings can be supported.
The point is more that we can’t deprecate these (in C) without replacements (in C).
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sg16-unicode/sg16/issues/2?email_source=notifications&email_token=AAKX763CXOOW5ZORKYZKGPLQCWMDPA5CNFSM4E34L7E2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3PP4FY#issuecomment-517930519, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKX76YFT2UR7AY3U2FZLNDQCWMDPANCNFSM4E34L7EQ .
The thing is - I'm pretty sure Unicode character properties are NOT a replacement.
I agree. They might be used in the implementation of a replacement though.
Unicode characters properties should NOT be locale dependent in anyway, cp_isupper(U'Γ') should always be true, regardless of the execution encoding, platform, etc
I agree, but the provided example is specifically passing a Unicode code point, so I don't think anyone would expect a locale dependency (this is not true for case mapping algorithms in general, but is for Unicode code point properties).
Ignoring the fact that isupper(foo) (for example) does not support anything but the first 255 value of a given character set
Technically, it supports all values that fit in a value of unsigned char
(which is usually 8-bit in practice).
a negative answer means either
- foo is not a upper case letter
- foo is not part of this non-unicode character set
Or foo isn't a code point at all (e.g., a trailing code unit value).
A code point based interface would solve all three of the bullet points I listed. (I would be fine with passing an invalid code point, errm, scalar value being a precondition violation; long live Contracts 2.0!)
The standard library specifies a number of interfaces that cannot be made to work reasonably well for Unicode. For example, from
<locale>
:std::ctype
,std::ctype_byname
std::isupper()
)std::toupper()
)Such interfaces are candidates for deprecation, replacement, and eventual removal.